ViewVC logotype

Contents of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log

Revision 75 - (hide annotations) (download)
Sat Feb 24 21:40:37 2007 UTC (8 years, 2 months ago) by nigel
File size: 61694 byte(s)
Load pcre-5.0 into code/trunk.

1 nigel 63 .TH PCRE 3
2     .SH NAME
3     PCRE - Perl-compatible regular expressions
5 nigel 63 .rs
6     .sp
7     The syntax and semantics of the regular expressions supported by PCRE are
8     described below. Regular expressions are also described in the Perl
9 nigel 75 documentation and in a number of books, some of which have copious examples.
10     Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly, covers
11     regular expressions in great detail. This description of PCRE's regular
12     expressions is intended as reference material.
13     .P
14     The original operation of PCRE was on strings of one-byte characters. However,
15     there is now also support for UTF-8 character strings. To use this, you must
16     build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with
17     the PCRE_UTF8 option. How this affects pattern matching is mentioned in several
18     places below. There is also a summary of UTF-8 features in the
19 nigel 63 .\" HTML <a href="pcre.html#utf8support">
20     .\" </a>
21     section on UTF-8 support
22     .\"
23     in the main
24     .\" HREF
25 nigel 75 \fBpcre\fP
26 nigel 63 .\"
27     page.
28 nigel 75 .P
29 nigel 63 A regular expression is a pattern that is matched against a subject string from
30     left to right. Most characters stand for themselves in a pattern, and match the
31     corresponding characters in the subject. As a trivial example, the pattern
32 nigel 75 .sp
33 nigel 63 The quick brown fox
34 nigel 75 .sp
35 nigel 63 matches a portion of a subject string that is identical to itself. The power of
36     regular expressions comes from the ability to include alternatives and
37     repetitions in the pattern. These are encoded in the pattern by the use of
38 nigel 75 \fImetacharacters\fP, which do not stand for themselves but instead are
39 nigel 63 interpreted in some special way.
40 nigel 75 .P
41     There are two different sets of metacharacters: those that are recognized
42 nigel 63 anywhere in the pattern except within square brackets, and those that are
43 nigel 75 recognized in square brackets. Outside square brackets, the metacharacters are
44 nigel 63 as follows:
45 nigel 75 .sp
46     \e general escape character with several uses
47 nigel 63 ^ assert start of string (or line, in multiline mode)
48     $ assert end of string (or line, in multiline mode)
49     . match any character except newline (by default)
50     [ start character class definition
51     | start of alternative branch
52     ( start subpattern
53     ) end subpattern
54     ? extends the meaning of (
55     also 0 or 1 quantifier
56     also quantifier minimizer
57     * 0 or more quantifier
58     + 1 or more quantifier
59     also "possessive quantifier"
60     { start min/max quantifier
61 nigel 75 .sp
62 nigel 63 Part of a pattern that is in square brackets is called a "character class". In
63 nigel 75 a character class the only metacharacters are:
64     .sp
65     \e general escape character
66 nigel 63 ^ negate the class, but only if the first character
67     - indicates character range
68 nigel 75 .\" JOIN
69 nigel 63 [ POSIX character class (only if followed by POSIX
70     syntax)
71     ] terminates the character class
72 nigel 75 .sp
73     The following sections describe the use of each of the metacharacters.
74     .
75 nigel 63 .SH BACKSLASH
76     .rs
77     .sp
78     The backslash character has several uses. Firstly, if it is followed by a
79 nigel 75 non-alphanumeric character, it takes away any special meaning that character may
80 nigel 63 have. This use of backslash as an escape character applies both inside and
81     outside character classes.
82 nigel 75 .P
83     For example, if you want to match a * character, you write \e* in the pattern.
84 nigel 63 This escaping action applies whether or not the following character would
85 nigel 75 otherwise be interpreted as a metacharacter, so it is always safe to precede a
86     non-alphanumeric with backslash to specify that it stands for itself. In
87     particular, if you want to match a backslash, you write \e\e.
88     .P
89 nigel 63 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
90     pattern (other than in a character class) and characters between a # outside
91     a character class and the next newline character are ignored. An escaping
92     backslash can be used to include a whitespace or # character as part of the
93     pattern.
94 nigel 75 .P
95 nigel 63 If you want to remove the special meaning from a sequence of characters, you
96 nigel 75 can do so by putting them between \eQ and \eE. This is different from Perl in
97     that $ and @ are handled as literals in \eQ...\eE sequences in PCRE, whereas in
98 nigel 63 Perl, $ and @ cause variable interpolation. Note the following examples:
99 nigel 75 .sp
100 nigel 63 Pattern PCRE matches Perl matches
101 nigel 75 .sp
102     .\" JOIN
103     \eQabc$xyz\eE abc$xyz abc followed by the
104 nigel 63 contents of $xyz
105 nigel 75 \eQabc\e$xyz\eE abc\e$xyz abc\e$xyz
106     \eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz
107     .sp
108     The \eQ...\eE sequence is recognized both inside and outside character classes.
109     .
110     .
111     .\" HTML <a name="digitsafterbackslash"></a>
112     .SS "Non-printing characters"
113     .rs
114     .sp
115 nigel 63 A second use of backslash provides a way of encoding non-printing characters
116     in patterns in a visible manner. There is no restriction on the appearance of
117     non-printing characters, apart from the binary zero that terminates a pattern,
118     but when a pattern is being prepared by text editing, it is usually easier to
119     use one of the following escape sequences than the binary character it
120     represents:
121 nigel 75 .sp
122     \ea alarm, that is, the BEL character (hex 07)
123     \ecx "control-x", where x is any character
124     \ee escape (hex 1B)
125     \ef formfeed (hex 0C)
126     \en newline (hex 0A)
127     \er carriage return (hex 0D)
128     \et tab (hex 09)
129     \eddd character with octal code ddd, or backreference
130     \exhh character with hex code hh
131     \ex{hhh..} character with hex code hhh... (UTF-8 mode only)
132     .sp
133     The precise effect of \ecx is as follows: if x is a lower case letter, it
134 nigel 63 is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
135 nigel 75 Thus \ecz becomes hex 1A, but \ec{ becomes hex 3B, while \ec; becomes hex
136 nigel 63 7B.
137 nigel 75 .P
138     After \ex, from zero to two hexadecimal digits are read (letters can be in
139 nigel 63 upper or lower case). In UTF-8 mode, any number of hexadecimal digits may
140 nigel 75 appear between \ex{ and }, but the value of the character code must be less
141 nigel 63 than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters
142 nigel 75 other than hexadecimal digits appear between \ex{ and }, or if there is no
143 nigel 63 terminating }, this form of escape is not recognized. Instead, the initial
144 nigel 75 \ex will be interpreted as a basic hexadecimal escape, with no following
145     digits, giving a character whose value is zero.
146     .P
147 nigel 63 Characters whose value is less than 256 can be defined by either of the two
148 nigel 75 syntaxes for \ex when PCRE is in UTF-8 mode. There is no difference in the
149     way they are handled. For example, \exdc is exactly the same as \ex{dc}.
150     .P
151     After \e0 up to two further octal digits are read. In both cases, if there
152 nigel 63 are fewer than two digits, just those that are present are used. Thus the
153 nigel 75 sequence \e0\ex\e07 specifies two binary zeros followed by a BEL character
154 nigel 63 (code value 7). Make sure you supply two digits after the initial zero if the
155 nigel 75 pattern character that follows is itself an octal digit.
156     .P
157 nigel 63 The handling of a backslash followed by a digit other than 0 is complicated.
158     Outside a character class, PCRE reads it and any following digits as a decimal
159     number. If the number is less than 10, or if there have been at least that many
160     previous capturing left parentheses in the expression, the entire sequence is
161 nigel 75 taken as a \fIback reference\fP. A description of how this works is given
162     .\" HTML <a href="#backreferences">
163     .\" </a>
164     later,
165     .\"
166     following the discussion of
167     .\" HTML <a href="#subpattern">
168     .\" </a>
169     parenthesized subpatterns.
170     .\"
171     .P
172 nigel 63 Inside a character class, or if the decimal number is greater than 9 and there
173     have not been that many capturing subpatterns, PCRE re-reads up to three octal
174     digits following the backslash, and generates a single byte from the least
175     significant 8 bits of the value. Any subsequent digits stand for themselves.
176     For example:
177 nigel 75 .sp
178     \e040 is another way of writing a space
179     .\" JOIN
180     \e40 is the same, provided there are fewer than 40
181 nigel 63 previous capturing subpatterns
182 nigel 75 \e7 is always a back reference
183     .\" JOIN
184     \e11 might be a back reference, or another way of
185 nigel 63 writing a tab
186 nigel 75 \e011 is always a tab
187     \e0113 is a tab followed by the character "3"
188     .\" JOIN
189     \e113 might be a back reference, otherwise the
190 nigel 63 character with octal code 113
191 nigel 75 .\" JOIN
192     \e377 might be a back reference, otherwise
193 nigel 63 the byte consisting entirely of 1 bits
194 nigel 75 .\" JOIN
195     \e81 is either a back reference, or a binary zero
196 nigel 63 followed by the two characters "8" and "1"
197 nigel 75 .sp
198 nigel 63 Note that octal values of 100 or greater must not be introduced by a leading
199     zero, because no more than three octal digits are ever read.
200 nigel 75 .P
201 nigel 63 All the sequences that define a single byte value or a single UTF-8 character
202     (in UTF-8 mode) can be used both inside and outside character classes. In
203 nigel 75 addition, inside a character class, the sequence \eb is interpreted as the
204     backspace character (hex 08), and the sequence \eX is interpreted as the
205     character "X". Outside a character class, these sequences have different
206     meanings
207     .\" HTML <a href="#uniextseq">
208     .\" </a>
209     (see below).
210     .\"
211     .
212     .
213     .SS "Generic character types"
214     .rs
215     .sp
216     The third use of backslash is for specifying generic character types. The
217     following are always recognized:
218     .sp
219     \ed any decimal digit
220     \eD any character that is not a decimal digit
221     \es any whitespace character
222     \eS any character that is not a whitespace character
223     \ew any "word" character
224     \eW any "non-word" character
225     .sp
226 nigel 63 Each pair of escape sequences partitions the complete set of characters into
227     two disjoint sets. Any given character matches one, and only one, of each pair.
228 nigel 75 .P
229     These character type sequences can appear both inside and outside character
230     classes. They each match one character of the appropriate type. If the current
231     matching point is at the end of the subject string, all of them fail, since
232     there is no character to match.
233     .P
234     For compatibility with Perl, \es does not match the VT character (code 11).
235     This makes it different from the the POSIX "space" class. The \es characters
236 nigel 63 are HT (9), LF (10), FF (12), CR (13), and space (32).
237 nigel 75 .P
238     A "word" character is an underscore or any character less than 256 that is a
239     letter or digit. The definition of letters and digits is controlled by PCRE's
240     low-valued character tables, and may vary if locale-specific matching is taking
241     place (see
242 nigel 63 .\" HTML <a href="pcreapi.html#localesupport">
243     .\" </a>
244     "Locale support"
245     .\"
246     in the
247     .\" HREF
248 nigel 75 \fBpcreapi\fP
249 nigel 63 .\"
250 nigel 75 page). For example, in the "fr_FR" (French) locale, some character codes
251     greater than 128 are used for accented letters, and these are matched by \ew.
252     .P
253     In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or
254     \ew, and always match \eD, \eS, and \eW. This is true even when Unicode
255     character property support is available.
256     .
257     .
258     .\" HTML <a name="uniextseq"></a>
259     .SS Unicode character properties
260     .rs
261     .sp
262     When PCRE is built with Unicode character property support, three additional
263     escape sequences to match generic character types are available when UTF-8 mode
264     is selected. They are:
265     .sp
266     \ep{\fIxx\fP} a character with the \fIxx\fP property
267     \eP{\fIxx\fP} a character without the \fIxx\fP property
268     \eX an extended Unicode sequence
269     .sp
270     The property names represented by \fIxx\fP above are limited to the
271     Unicode general category properties. Each character has exactly one such
272     property, specified by a two-letter abbreviation. For compatibility with Perl,
273     negation can be specified by including a circumflex between the opening brace
274     and the property name. For example, \ep{^Lu} is the same as \eP{Lu}.
275     .P
276     If only one letter is specified with \ep or \eP, it includes all the properties
277     that start with that letter. In this case, in the absence of negation, the
278     curly brackets in the escape sequence are optional; these two examples have
279     the same effect:
280     .sp
281     \ep{L}
282     \epL
283     .sp
284     The following property codes are supported:
285     .sp
286     C Other
287     Cc Control
288     Cf Format
289     Cn Unassigned
290     Co Private use
291     Cs Surrogate
292     .sp
293     L Letter
294     Ll Lower case letter
295     Lm Modifier letter
296     Lo Other letter
297     Lt Title case letter
298     Lu Upper case letter
299     .sp
300     M Mark
301     Mc Spacing mark
302     Me Enclosing mark
303     Mn Non-spacing mark
304     .sp
305     N Number
306     Nd Decimal number
307     Nl Letter number
308     No Other number
309     .sp
310     P Punctuation
311     Pc Connector punctuation
312     Pd Dash punctuation
313     Pe Close punctuation
314     Pf Final punctuation
315     Pi Initial punctuation
316     Po Other punctuation
317     Ps Open punctuation
318     .sp
319     S Symbol
320     Sc Currency symbol
321     Sk Modifier symbol
322     Sm Mathematical symbol
323     So Other symbol
324     .sp
325     Z Separator
326     Zl Line separator
327     Zp Paragraph separator
328     Zs Space separator
329     .sp
330     Extended properties such as "Greek" or "InMusicalSymbols" are not supported by
331     PCRE.
332     .P
333     Specifying caseless matching does not affect these escape sequences. For
334     example, \ep{Lu} always matches only upper case letters.
335     .P
336     The \eX escape matches any number of Unicode characters that form an extended
337     Unicode sequence. \eX is equivalent to
338     .sp
339     (?>\ePM\epM*)
340     .sp
341     That is, it matches a character without the "mark" property, followed by zero
342     or more characters with the "mark" property, and treats the sequence as an
343     atomic group
344     .\" HTML <a href="#atomicgroup">
345     .\" </a>
346     (see below).
347     .\"
348     Characters with the "mark" property are typically accents that affect the
349     preceding character.
350     .P
351     Matching characters by Unicode property is not fast, because PCRE has to search
352     a structure that contains data for over fifteen thousand characters. That is
353     why the traditional escape sequences such as \ed and \ew do not use Unicode
354     properties in PCRE.
355     .
356     .
357     .\" HTML <a name="smallassertions"></a>
358     .SS "Simple assertions"
359     .rs
360     .sp
361 nigel 63 The fourth use of backslash is for certain simple assertions. An assertion
362     specifies a condition that has to be met at a particular point in a match,
363     without consuming any characters from the subject string. The use of
364 nigel 75 subpatterns for more complicated assertions is described
365     .\" HTML <a href="#bigassertions">
366     .\" </a>
367     below.
368     .\"
369     The backslashed
370     assertions are:
371     .sp
372     \eb matches at a word boundary
373     \eB matches when not at a word boundary
374     \eA matches at start of subject
375     \eZ matches at end of subject or before newline at end
376     \ez matches at end of subject
377     \eG matches at first matching position in subject
378     .sp
379     These assertions may not appear in character classes (but note that \eb has a
380 nigel 63 different meaning, namely the backspace character, inside a character class).
381 nigel 75 .P
382 nigel 63 A word boundary is a position in the subject string where the current character
383 nigel 75 and the previous character do not both match \ew or \eW (i.e. one matches
384     \ew and the other matches \eW), or the start or end of the string if the
385     first or last character matches \ew, respectively.
386     .P
387     The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
388     dollar (described in the next section) in that they only ever match at the very
389     start and end of the subject string, whatever options are set. Thus, they are
390     independent of multiline mode. These three assertions are not affected by the
391     PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the behaviour of the
392     circumflex and dollar metacharacters. However, if the \fIstartoffset\fP
393     argument of \fBpcre_exec()\fP is non-zero, indicating that matching is to start
394     at a point other than the beginning of the subject, \eA can never match. The
395     difference between \eZ and \ez is that \eZ matches before a newline that is the
396     last character of the string as well as at the end of the string, whereas \ez
397     matches only at the end.
398     .P
399     The \eG assertion is true only when the current matching position is at the
400     start point of the match, as specified by the \fIstartoffset\fP argument of
401     \fBpcre_exec()\fP. It differs from \eA when the value of \fIstartoffset\fP is
402     non-zero. By calling \fBpcre_exec()\fP multiple times with appropriate
403 nigel 63 arguments, you can mimic Perl's /g option, and it is in this kind of
404 nigel 75 implementation where \eG can be useful.
405     .P
406     Note, however, that PCRE's interpretation of \eG, as the start of the current
407 nigel 63 match, is subtly different from Perl's, which defines it as the end of the
408     previous match. In Perl, these can be different when the previously matched
409     string was empty. Because PCRE does just one match at a time, it cannot
410     reproduce this behaviour.
411 nigel 75 .P
412     If all the alternatives of a pattern begin with \eG, the expression is anchored
413 nigel 63 to the starting match position, and the "anchored" flag is set in the compiled
414     regular expression.
415 nigel 75 .
416     .
418 nigel 63 .rs
419     .sp
420     Outside a character class, in the default matching mode, the circumflex
421 nigel 75 character is an assertion that is true only if the current matching point is
422     at the start of the subject string. If the \fIstartoffset\fP argument of
423     \fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE
424 nigel 63 option is unset. Inside a character class, circumflex has an entirely different
425 nigel 75 meaning
426     .\" HTML <a href="#characterclass">
427     .\" </a>
428     (see below).
429     .\"
430     .P
431 nigel 63 Circumflex need not be the first character of the pattern if a number of
432     alternatives are involved, but it should be the first thing in each alternative
433     in which it appears if the pattern is ever to match that branch. If all
434     possible alternatives start with a circumflex, that is, if the pattern is
435     constrained to match only at the start of the subject, it is said to be an
436     "anchored" pattern. (There are also other constructs that can cause a pattern
437     to be anchored.)
438 nigel 75 .P
439     A dollar character is an assertion that is true only if the current matching
440 nigel 63 point is at the end of the subject string, or immediately before a newline
441     character that is the last character in the string (by default). Dollar need
442     not be the last character of the pattern if a number of alternatives are
443     involved, but it should be the last item in any branch in which it appears.
444     Dollar has no special meaning in a character class.
445 nigel 75 .P
446 nigel 63 The meaning of dollar can be changed so that it matches only at the very end of
447     the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
448 nigel 75 does not affect the \eZ assertion.
449     .P
450 nigel 63 The meanings of the circumflex and dollar characters are changed if the
451     PCRE_MULTILINE option is set. When this is the case, they match immediately
452     after and immediately before an internal newline character, respectively, in
453     addition to matching at the start and end of the subject string. For example,
454 nigel 75 the pattern /^abc$/ matches the subject string "def\enabc" (where \en
455     represents a newline character) in multiline mode, but not otherwise.
456     Consequently, patterns that are anchored in single line mode because all
457     branches start with ^ are not anchored in multiline mode, and a match for
458     circumflex is possible when the \fIstartoffset\fP argument of \fBpcre_exec()\fP
459     is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
460     set.
461     .P
462     Note that the sequences \eA, \eZ, and \ez can be used to match the start and
463 nigel 63 end of the subject in both modes, and if all branches of a pattern start with
464 nigel 75 \eA it is always anchored, whether PCRE_MULTILINE is set or not.
465     .
466     .
468 nigel 63 .rs
469     .sp
470     Outside a character class, a dot in the pattern matches any one character in
471     the subject, including a non-printing character, but not (by default) newline.
472     In UTF-8 mode, a dot matches any UTF-8 character, which might be more than one
473 nigel 75 byte long, except (by default) newline. If the PCRE_DOTALL option is set,
474 nigel 63 dots match newlines as well. The handling of dot is entirely independent of the
475     handling of circumflex and dollar, the only relationship being that they both
476     involve newline characters. Dot has no special meaning in a character class.
477 nigel 75 .
478     .
480 nigel 63 .rs
481     .sp
482 nigel 75 Outside a character class, the escape sequence \eC matches any one byte, both
483     in and out of UTF-8 mode. Unlike a dot, it can match a newline. The feature is
484     provided in Perl in order to match individual bytes in UTF-8 mode. Because it
485     breaks up UTF-8 characters into individual bytes, what remains in the string
486     may be a malformed UTF-8 string. For this reason, the \eC escape sequence is
487     best avoided.
488     .P
489     PCRE does not allow \eC to appear in lookbehind assertions
490     .\" HTML <a href="#lookbehind">
491     .\" </a>
492     (described below),
493     .\"
494     because in UTF-8 mode this would make it impossible to calculate the length of
495     the lookbehind.
496     .
497     .
498     .\" HTML <a name="characterclass"></a>
500 nigel 63 .rs
501     .sp
502     An opening square bracket introduces a character class, terminated by a closing
503     square bracket. A closing square bracket on its own is not special. If a
504     closing square bracket is required as a member of the class, it should be the
505     first data character in the class (after an initial circumflex, if present) or
506     escaped with a backslash.
507 nigel 75 .P
508 nigel 63 A character class matches a single character in the subject. In UTF-8 mode, the
509     character may occupy more than one byte. A matched character must be in the set
510     of characters defined by the class, unless the first character in the class
511     definition is a circumflex, in which case the subject character must not be in
512     the set defined by the class. If a circumflex is actually required as a member
513     of the class, ensure it is not the first character, or escape it with a
514     backslash.
515 nigel 75 .P
516 nigel 63 For example, the character class [aeiou] matches any lower case vowel, while
517     [^aeiou] matches any character that is not a lower case vowel. Note that a
518 nigel 75 circumflex is just a convenient notation for specifying the characters that
519     are in the class by enumerating those that are not. A class that starts with a
520     circumflex is not an assertion: it still consumes a character from the subject
521     string, and therefore it fails if the current pointer is at the end of the
522     string.
523     .P
524 nigel 63 In UTF-8 mode, characters with values greater than 255 can be included in a
525 nigel 75 class as a literal string of bytes, or by using the \ex{ escaping mechanism.
526     .P
527 nigel 63 When caseless matching is set, any letters in a class represent both their
528     upper case and lower case versions, so for example, a caseless [aeiou] matches
529     "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
530 nigel 75 caseful version would. When running in UTF-8 mode, PCRE supports the concept of
531     case for characters with values greater than 128 only when it is compiled with
532     Unicode property support.
533     .P
534 nigel 63 The newline character is never treated in any special way in character classes,
535     whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class
536     such as [^a] will always match a newline.
537 nigel 75 .P
538 nigel 63 The minus (hyphen) character can be used to specify a range of characters in a
539     character class. For example, [d-m] matches any letter between d and m,
540     inclusive. If a minus character is required in a class, it must be escaped with
541     a backslash or appear in a position where it cannot be interpreted as
542     indicating a range, typically as the first or last character in the class.
543 nigel 75 .P
544 nigel 63 It is not possible to have the literal character "]" as the end character of a
545     range. A pattern such as [W-]46] is interpreted as a class of two characters
546     ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
547     "-46]". However, if the "]" is escaped with a backslash it is interpreted as
548 nigel 75 the end of range, so [W-\e]46] is interpreted as a class containing a range
549     followed by two other characters. The octal or hexadecimal representation of
550     "]" can also be used to end a range.
551     .P
552 nigel 63 Ranges operate in the collating sequence of character values. They can also be
553 nigel 75 used for characters specified numerically, for example [\e000-\e037]. In UTF-8
554 nigel 63 mode, ranges can include characters whose values are greater than 255, for
555 nigel 75 example [\ex{100}-\ex{2ff}].
556     .P
557 nigel 63 If a range that includes letters is used when caseless matching is set, it
558     matches the letters in either case. For example, [W-c] is equivalent to
559 nigel 75 [][\e\e^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character
560     tables for the "fr_FR" locale are in use, [\exc8-\excb] matches accented E
561     characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
562     characters with values greater than 128 only when it is compiled with Unicode
563     property support.
564     .P
565     The character types \ed, \eD, \ep, \eP, \es, \eS, \ew, and \eW may also appear
566     in a character class, and add the characters that they match to the class. For
567     example, [\edABCDEF] matches any hexadecimal digit. A circumflex can
568 nigel 63 conveniently be used with the upper case character types to specify a more
569     restricted set of characters than the matching lower case type. For example,
570 nigel 75 the class [^\eW_] matches any letter or digit, but not underscore.
571     .P
572     The only metacharacters that are recognized in character classes are backslash,
573     hyphen (only where it can be interpreted as specifying a range), circumflex
574     (only at the start), opening square bracket (only when it can be interpreted as
575     introducing a POSIX class name - see the next section), and the terminating
576     closing square bracket. However, escaping other non-alphanumeric characters
577     does no harm.
578     .
579     .
581 nigel 63 .rs
582     .sp
583 nigel 75 Perl supports the POSIX notation for character classes. This uses names
584 nigel 63 enclosed by [: and :] within the enclosing square brackets. PCRE also supports
585     this notation. For example,
586 nigel 75 .sp
587 nigel 63 [01[:alpha:]%]
588 nigel 75 .sp
589 nigel 63 matches "0", "1", any alphabetic character, or "%". The supported class names
590     are
591 nigel 75 .sp
592 nigel 63 alnum letters and digits
593     alpha letters
594     ascii character codes 0 - 127
595     blank space or tab only
596     cntrl control characters
597 nigel 75 digit decimal digits (same as \ed)
598 nigel 63 graph printing characters, excluding space
599     lower lower case letters
600     print printing characters, including space
601     punct printing characters, excluding letters and digits
602 nigel 75 space white space (not quite the same as \es)
603 nigel 63 upper upper case letters
604 nigel 75 word "word" characters (same as \ew)
605 nigel 63 xdigit hexadecimal digits
606 nigel 75 .sp
607 nigel 63 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
608     space (32). Notice that this list includes the VT character (code 11). This
609 nigel 75 makes "space" different to \es, which does not include VT (for Perl
610 nigel 63 compatibility).
611 nigel 75 .P
612 nigel 63 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
613     5.8. Another Perl extension is negation, which is indicated by a ^ character
614     after the colon. For example,
615 nigel 75 .sp
616 nigel 63 [12[:^digit:]]
617 nigel 75 .sp
618 nigel 63 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
619     syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
620     supported, and an error is given if they are encountered.
621 nigel 75 .P
622     In UTF-8 mode, characters with values greater than 128 do not match any of
623 nigel 63 the POSIX character classes.
624 nigel 75 .
625     .
626     .SH "VERTICAL BAR"
627 nigel 63 .rs
628     .sp
629     Vertical bar characters are used to separate alternative patterns. For example,
630     the pattern
631 nigel 75 .sp
632 nigel 63 gilbert|sullivan
633 nigel 75 .sp
634 nigel 63 matches either "gilbert" or "sullivan". Any number of alternatives may appear,
635     and an empty alternative is permitted (matching the empty string).
636     The matching process tries each alternative in turn, from left to right,
637     and the first one that succeeds is used. If the alternatives are within a
638 nigel 75 subpattern
639     .\" HTML <a href="#subpattern">
640     .\" </a>
641     (defined below),
642     .\"
643     "succeeds" means matching the rest of the main pattern as well as the
644     alternative in the subpattern.
645     .
646     .
648 nigel 63 .rs
649     .sp
650     The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
651     PCRE_EXTENDED options can be changed from within the pattern by a sequence of
652     Perl option letters enclosed between "(?" and ")". The option letters are
653 nigel 75 .sp
654 nigel 63 i for PCRE_CASELESS
655     m for PCRE_MULTILINE
656     s for PCRE_DOTALL
657     x for PCRE_EXTENDED
658 nigel 75 .sp
659 nigel 63 For example, (?im) sets caseless, multiline matching. It is also possible to
660     unset these options by preceding the letter with a hyphen, and a combined
661     setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
662     PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
663     permitted. If a letter appears both before and after the hyphen, the option is
664     unset.
665 nigel 75 .P
666 nigel 63 When an option change occurs at top level (that is, not inside subpattern
667     parentheses), the change applies to the remainder of the pattern that follows.
668     If the change is placed right at the start of a pattern, PCRE extracts it into
669     the global options (and it will therefore show up in data extracted by the
670 nigel 75 \fBpcre_fullinfo()\fP function).
671     .P
672 nigel 63 An option change within a subpattern affects only that part of the current
673     pattern that follows it, so
674 nigel 75 .sp
675 nigel 63 (a(?i)b)c
676 nigel 75 .sp
677 nigel 63 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
678     By this means, options can be made to have different settings in different
679     parts of the pattern. Any changes made in one alternative do carry on
680     into subsequent branches within the same subpattern. For example,
681 nigel 75 .sp
682 nigel 63 (a(?i)b|c)
683 nigel 75 .sp
684 nigel 63 matches "ab", "aB", "c", and "C", even though when matching "C" the first
685     branch is abandoned before the option setting. This is because the effects of
686     option settings happen at compile time. There would be some very weird
687     behaviour otherwise.
688 nigel 75 .P
689 nigel 63 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the
690     same way as the Perl-compatible options by using the characters U and X
691     respectively. The (?X) flag setting is special in that it must always occur
692     earlier in the pattern than any of the additional features it turns on, even
693 nigel 75 when it is at top level. It is best to put it at the start.
694     .
695     .
696     .\" HTML <a name="subpattern"></a>
697 nigel 63 .SH SUBPATTERNS
698     .rs
699     .sp
700     Subpatterns are delimited by parentheses (round brackets), which can be nested.
701 nigel 75 Turning part of a pattern into a subpattern does two things:
702     .sp
703 nigel 63 1. It localizes a set of alternatives. For example, the pattern
704 nigel 75 .sp
705 nigel 63 cat(aract|erpillar|)
706 nigel 75 .sp
707 nigel 63 matches one of the words "cat", "cataract", or "caterpillar". Without the
708     parentheses, it would match "cataract", "erpillar" or the empty string.
709 nigel 75 .sp
710     2. It sets up the subpattern as a capturing subpattern. This means that, when
711     the whole pattern matches, that portion of the subject string that matched the
712     subpattern is passed back to the caller via the \fIovector\fP argument of
713     \fBpcre_exec()\fP. Opening parentheses are counted from left to right (starting
714     from 1) to obtain numbers for the capturing subpatterns.
715     .P
716 nigel 63 For example, if the string "the red king" is matched against the pattern
717 nigel 75 .sp
718 nigel 63 the ((red|white) (king|queen))
719 nigel 75 .sp
720 nigel 63 the captured substrings are "red king", "red", and "king", and are numbered 1,
721     2, and 3, respectively.
722 nigel 75 .P
723 nigel 63 The fact that plain parentheses fulfil two functions is not always helpful.
724     There are often times when a grouping subpattern is required without a
725     capturing requirement. If an opening parenthesis is followed by a question mark
726     and a colon, the subpattern does not do any capturing, and is not counted when
727     computing the number of any subsequent capturing subpatterns. For example, if
728     the string "the white queen" is matched against the pattern
729 nigel 75 .sp
730 nigel 63 the ((?:red|white) (king|queen))
731 nigel 75 .sp
732 nigel 63 the captured substrings are "white queen" and "queen", and are numbered 1 and
733     2. The maximum number of capturing subpatterns is 65535, and the maximum depth
734     of nesting of all subpatterns, both capturing and non-capturing, is 200.
735 nigel 75 .P
736 nigel 63 As a convenient shorthand, if any option settings are required at the start of
737     a non-capturing subpattern, the option letters may appear between the "?" and
738     the ":". Thus the two patterns
739 nigel 75 .sp
740 nigel 63 (?i:saturday|sunday)
741     (?:(?i)saturday|sunday)
742 nigel 75 .sp
743 nigel 63 match exactly the same set of strings. Because alternative branches are tried
744     from left to right, and options are not reset until the end of the subpattern
745     is reached, an option setting in one branch does affect subsequent branches, so
746     the above patterns match "SUNDAY" as well as "Saturday".
747 nigel 75 .
748     .
750 nigel 63 .rs
751     .sp
752     Identifying capturing parentheses by number is simple, but it can be very hard
753     to keep track of the numbers in complicated regular expressions. Furthermore,
754 nigel 75 if an expression is modified, the numbers may change. To help with this
755 nigel 63 difficulty, PCRE supports the naming of subpatterns, something that Perl does
756     not provide. The Python syntax (?P<name>...) is used. Names consist of
757     alphanumeric characters and underscores, and must be unique within a pattern.
758 nigel 75 .P
759 nigel 63 Named capturing parentheses are still allocated numbers as well as names. The
760     PCRE API provides function calls for extracting the name-to-number translation
761 nigel 75 table from a compiled pattern. There is also a convenience function for
762     extracting a captured substring by name. For further details see the
763 nigel 63 .\" HREF
764 nigel 75 \fBpcreapi\fP
765 nigel 63 .\"
766     documentation.
767 nigel 75 .
768     .
769 nigel 63 .SH REPETITION
770     .rs
771     .sp
772     Repetition is specified by quantifiers, which can follow any of the following
773     items:
774 nigel 75 .sp
775 nigel 63 a literal data character
776     the . metacharacter
777 nigel 75 the \eC escape sequence
778     the \eX escape sequence (in UTF-8 mode with Unicode properties)
779     an escape such as \ed that matches a single character
780 nigel 63 a character class
781     a back reference (see next section)
782     a parenthesized subpattern (unless it is an assertion)
783 nigel 75 .sp
784 nigel 63 The general repetition quantifier specifies a minimum and maximum number of
785     permitted matches, by giving the two numbers in curly brackets (braces),
786     separated by a comma. The numbers must be less than 65536, and the first must
787     be less than or equal to the second. For example:
788 nigel 75 .sp
789 nigel 63 z{2,4}
790 nigel 75 .sp
791 nigel 63 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
792     character. If the second number is omitted, but the comma is present, there is
793     no upper limit; if the second number and the comma are both omitted, the
794     quantifier specifies an exact number of required matches. Thus
795 nigel 75 .sp
796 nigel 63 [aeiou]{3,}
797 nigel 75 .sp
798 nigel 63 matches at least 3 successive vowels, but may match many more, while
799 nigel 75 .sp
800     \ed{8}
801     .sp
802 nigel 63 matches exactly 8 digits. An opening curly bracket that appears in a position
803     where a quantifier is not allowed, or one that does not match the syntax of a
804     quantifier, is taken as a literal character. For example, {,6} is not a
805     quantifier, but a literal string of four characters.
806 nigel 75 .P
807 nigel 63 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
808 nigel 75 bytes. Thus, for example, \ex{100}{2} matches two UTF-8 characters, each of
809     which is represented by a two-byte sequence. Similarly, when Unicode property
810     support is available, \eX{3} matches three Unicode extended sequences, each of
811     which may be several bytes long (and they may be of different lengths).
812     .P
813 nigel 63 The quantifier {0} is permitted, causing the expression to behave as if the
814     previous item and the quantifier were not present.
815 nigel 75 .P
816 nigel 63 For convenience (and historical compatibility) the three most common
817     quantifiers have single-character abbreviations:
818 nigel 75 .sp
819 nigel 63 * is equivalent to {0,}
820     + is equivalent to {1,}
821     ? is equivalent to {0,1}
822 nigel 75 .sp
823 nigel 63 It is possible to construct infinite loops by following a subpattern that can
824     match no characters with a quantifier that has no upper limit, for example:
825 nigel 75 .sp
826 nigel 63 (a?)*
827 nigel 75 .sp
828 nigel 63 Earlier versions of Perl and PCRE used to give an error at compile time for
829     such patterns. However, because there are cases where this can be useful, such
830     patterns are now accepted, but if any repetition of the subpattern does in fact
831     match no characters, the loop is forcibly broken.
832 nigel 75 .P
833 nigel 63 By default, the quantifiers are "greedy", that is, they match as much as
834     possible (up to the maximum number of permitted times), without causing the
835     rest of the pattern to fail. The classic example of where this gives problems
836 nigel 75 is in trying to match comments in C programs. These appear between /* and */
837     and within the comment, individual * and / characters may appear. An attempt to
838     match C comments by applying the pattern
839     .sp
840     /\e*.*\e*/
841     .sp
842 nigel 63 to the string
843 nigel 75 .sp
844     /* first comment */ not comment /* second comment */
845     .sp
846 nigel 63 fails, because it matches the entire string owing to the greediness of the .*
847     item.
848 nigel 75 .P
849 nigel 63 However, if a quantifier is followed by a question mark, it ceases to be
850     greedy, and instead matches the minimum number of times possible, so the
851     pattern
852 nigel 75 .sp
853     /\e*.*?\e*/
854     .sp
855 nigel 63 does the right thing with the C comments. The meaning of the various
856     quantifiers is not otherwise changed, just the preferred number of matches.
857     Do not confuse this use of question mark with its use as a quantifier in its
858     own right. Because it has two uses, it can sometimes appear doubled, as in
859 nigel 75 .sp
860     \ed??\ed
861     .sp
862 nigel 63 which matches one digit by preference, but can match two if that is the only
863     way the rest of the pattern matches.
864 nigel 75 .P
865 nigel 63 If the PCRE_UNGREEDY option is set (an option which is not available in Perl),
866     the quantifiers are not greedy by default, but individual ones can be made
867     greedy by following them with a question mark. In other words, it inverts the
868     default behaviour.
869 nigel 75 .P
870 nigel 63 When a parenthesized subpattern is quantified with a minimum repeat count that
871 nigel 75 is greater than 1 or with a limited maximum, more memory is required for the
872 nigel 63 compiled pattern, in proportion to the size of the minimum or maximum.
873 nigel 75 .P
874 nigel 63 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
875     to Perl's /s) is set, thus allowing the . to match newlines, the pattern is
876     implicitly anchored, because whatever follows will be tried against every
877     character position in the subject string, so there is no point in retrying the
878     overall match at any position after the first. PCRE normally treats such a
879 nigel 75 pattern as though it were preceded by \eA.
880     .P
881 nigel 63 In cases where it is known that the subject string contains no newlines, it is
882     worth setting PCRE_DOTALL in order to obtain this optimization, or
883     alternatively using ^ to indicate anchoring explicitly.
884 nigel 75 .P
885 nigel 63 However, there is one situation where the optimization cannot be used. When .*
886     is inside capturing parentheses that are the subject of a backreference
887     elsewhere in the pattern, a match at the start may fail, and a later one
888     succeed. Consider, for example:
889 nigel 75 .sp
890     (.*)abc\e1
891     .sp
892 nigel 63 If the subject is "xyz123abc123" the match point is the fourth character. For
893     this reason, such a pattern is not implicitly anchored.
894 nigel 75 .P
895 nigel 63 When a capturing subpattern is repeated, the value captured is the substring
896     that matched the final iteration. For example, after
897 nigel 75 .sp
898     (tweedle[dume]{3}\es*)+
899     .sp
900 nigel 63 has matched "tweedledum tweedledee" the value of the captured substring is
901     "tweedledee". However, if there are nested capturing subpatterns, the
902     corresponding captured values may have been set in previous iterations. For
903     example, after
904 nigel 75 .sp
905 nigel 63 /(a|(b))+/
906 nigel 75 .sp
907 nigel 63 matches "aba" the value of the second captured substring is "b".
908 nigel 75 .
909     .
910     .\" HTML <a name="atomicgroup"></a>
912 nigel 63 .rs
913     .sp
914     With both maximizing and minimizing repetition, failure of what follows
915     normally causes the repeated item to be re-evaluated to see if a different
916     number of repeats allows the rest of the pattern to match. Sometimes it is
917     useful to prevent this, either to change the nature of the match, or to cause
918     it fail earlier than it otherwise might, when the author of the pattern knows
919     there is no point in carrying on.
920 nigel 75 .P
921     Consider, for example, the pattern \ed+foo when applied to the subject line
922     .sp
923 nigel 63 123456bar
924 nigel 75 .sp
925 nigel 63 After matching all 6 digits and then failing to match "foo", the normal
926 nigel 75 action of the matcher is to try again with only 5 digits matching the \ed+
927 nigel 63 item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
928     (a term taken from Jeffrey Friedl's book) provides the means for specifying
929     that once a subpattern has matched, it is not to be re-evaluated in this way.
930 nigel 75 .P
931 nigel 63 If we use atomic grouping for the previous example, the matcher would give up
932     immediately on failing to match "foo" the first time. The notation is a kind of
933     special parenthesis, starting with (?> as in this example:
934 nigel 75 .sp
935     (?>\ed+)foo
936     .sp
937 nigel 63 This kind of parenthesis "locks up" the part of the pattern it contains once
938     it has matched, and a failure further into the pattern is prevented from
939     backtracking into it. Backtracking past it to previous items, however, works as
940     normal.
941 nigel 75 .P
942 nigel 63 An alternative description is that a subpattern of this type matches the string
943     of characters that an identical standalone pattern would match, if anchored at
944     the current point in the subject string.
945 nigel 75 .P
946 nigel 63 Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
947     the above example can be thought of as a maximizing repeat that must swallow
948 nigel 75 everything it can. So, while both \ed+ and \ed+? are prepared to adjust the
949 nigel 63 number of digits they match in order to make the rest of the pattern match,
950 nigel 75 (?>\ed+) can only match an entire sequence of digits.
951     .P
952 nigel 63 Atomic groups in general can of course contain arbitrarily complicated
953     subpatterns, and can be nested. However, when the subpattern for an atomic
954     group is just a single repeated item, as in the example above, a simpler
955     notation, called a "possessive quantifier" can be used. This consists of an
956     additional + character following a quantifier. Using this notation, the
957     previous example can be rewritten as
958 nigel 75 .sp
959     \ed++foo
960     .sp
961 nigel 63 Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
962     option is ignored. They are a convenient notation for the simpler forms of
963     atomic group. However, there is no difference in the meaning or processing of a
964     possessive quantifier and the equivalent atomic group.
965 nigel 75 .P
966 nigel 63 The possessive quantifier syntax is an extension to the Perl syntax. It
967     originates in Sun's Java package.
968 nigel 75 .P
969 nigel 63 When a pattern contains an unlimited repeat inside a subpattern that can itself
970     be repeated an unlimited number of times, the use of an atomic group is the
971     only way to avoid some failing matches taking a very long time indeed. The
972     pattern
973 nigel 75 .sp
974     (\eD+|<\ed+>)*[!?]
975     .sp
976 nigel 63 matches an unlimited number of substrings that either consist of non-digits, or
977     digits enclosed in <>, followed by either ! or ?. When it matches, it runs
978     quickly. However, if it is applied to
979 nigel 75 .sp
980 nigel 63 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
981 nigel 75 .sp
982 nigel 63 it takes a long time before reporting failure. This is because the string can
983 nigel 75 be divided between the internal \eD+ repeat and the external * repeat in a
984     large number of ways, and all have to be tried. (The example uses [!?] rather
985     than a single character at the end, because both PCRE and Perl have an
986     optimization that allows for fast failure when a single character is used. They
987     remember the last single character that is required for a match, and fail early
988     if it is not present in the string.) If the pattern is changed so that it uses
989     an atomic group, like this:
990     .sp
991     ((?>\eD+)|<\ed+>)*[!?]
992     .sp
993 nigel 63 sequences of non-digits cannot be broken, and failure happens quickly.
994 nigel 75 .
995     .
996     .\" HTML <a name="backreferences"></a>
998 nigel 63 .rs
999     .sp
1000     Outside a character class, a backslash followed by a digit greater than 0 (and
1001     possibly further digits) is a back reference to a capturing subpattern earlier
1002     (that is, to its left) in the pattern, provided there have been that many
1003     previous capturing left parentheses.
1004 nigel 75 .P
1005 nigel 63 However, if the decimal number following the backslash is less than 10, it is
1006     always taken as a back reference, and causes an error only if there are not
1007     that many capturing left parentheses in the entire pattern. In other words, the
1008     parentheses that are referenced need not be to the left of the reference for
1009 nigel 75 numbers less than 10. See the subsection entitled "Non-printing characters"
1010     .\" HTML <a href="#digitsafterbackslash">
1011     .\" </a>
1012     above
1013     .\"
1014     for further details of the handling of digits following a backslash.
1015     .P
1016 nigel 63 A back reference matches whatever actually matched the capturing subpattern in
1017     the current subject string, rather than anything matching the subpattern
1018     itself (see
1019     .\" HTML <a href="#subpatternsassubroutines">
1020     .\" </a>
1021     "Subpatterns as subroutines"
1022     .\"
1023     below for a way of doing that). So the pattern
1024 nigel 75 .sp
1025     (sens|respons)e and \e1ibility
1026     .sp
1027 nigel 63 matches "sense and sensibility" and "response and responsibility", but not
1028     "sense and responsibility". If caseful matching is in force at the time of the
1029     back reference, the case of letters is relevant. For example,
1030 nigel 75 .sp
1031     ((?i)rah)\es+\e1
1032     .sp
1033 nigel 63 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
1034     capturing subpattern is matched caselessly.
1035 nigel 75 .P
1036 nigel 63 Back references to named subpatterns use the Python syntax (?P=name). We could
1037     rewrite the above example as follows:
1038 nigel 75 .sp
1039     (?<p1>(?i)rah)\es+(?P=p1)
1040     .sp
1041 nigel 63 There may be more than one back reference to the same subpattern. If a
1042     subpattern has not actually been used in a particular match, any back
1043     references to it always fail. For example, the pattern
1044 nigel 75 .sp
1045     (a|(bc))\e2
1046     .sp
1047 nigel 63 always fails if it starts to match "a" rather than "bc". Because there may be
1048     many capturing parentheses in a pattern, all digits following the backslash are
1049     taken as part of a potential back reference number. If the pattern continues
1050     with a digit character, some delimiter must be used to terminate the back
1051     reference. If the PCRE_EXTENDED option is set, this can be whitespace.
1052 nigel 75 Otherwise an empty comment (see
1053     .\" HTML <a href="#comments">
1054     .\" </a>
1055     "Comments"
1056     .\"
1057     below) can be used.
1058     .P
1059 nigel 63 A back reference that occurs inside the parentheses to which it refers fails
1060 nigel 75 when the subpattern is first used, so, for example, (a\e1) never matches.
1061 nigel 63 However, such references can be useful inside repeated subpatterns. For
1062     example, the pattern
1063 nigel 75 .sp
1064     (a|b\e1)+
1065     .sp
1066 nigel 63 matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
1067     the subpattern, the back reference matches the character string corresponding
1068     to the previous iteration. In order for this to work, the pattern must be such
1069     that the first iteration does not need to match the back reference. This can be
1070     done using alternation, as in the example above, or by a quantifier with a
1071     minimum of zero.
1072 nigel 75 .
1073     .
1074     .\" HTML <a name="bigassertions"></a>
1075 nigel 63 .SH ASSERTIONS
1076     .rs
1077     .sp
1078     An assertion is a test on the characters following or preceding the current
1079     matching point that does not actually consume any characters. The simple
1080 nigel 75 assertions coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
1081     .\" HTML <a href="#smallassertions">
1082     .\" </a>
1083     above.
1084     .\"
1085     .P
1086 nigel 63 More complicated assertions are coded as subpatterns. There are two kinds:
1087     those that look ahead of the current position in the subject string, and those
1088 nigel 75 that look behind it. An assertion subpattern is matched in the normal way,
1089     except that it does not cause the current matching position to be changed.
1090     .P
1091     Assertion subpatterns are not capturing subpatterns, and may not be repeated,
1092     because it makes no sense to assert the same thing several times. If any kind
1093     of assertion contains capturing subpatterns within it, these are counted for
1094     the purposes of numbering the capturing subpatterns in the whole pattern.
1095     However, substring capturing is carried out only for positive assertions,
1096     because it does not make sense for negative assertions.
1097     .
1098     .
1099     .SS "Lookahead assertions"
1100     .rs
1101     .sp
1102     Lookahead assertions start
1103 nigel 63 with (?= for positive assertions and (?! for negative assertions. For example,
1104 nigel 75 .sp
1105     \ew+(?=;)
1106     .sp
1107 nigel 63 matches a word followed by a semicolon, but does not include the semicolon in
1108     the match, and
1109 nigel 75 .sp
1110 nigel 63 foo(?!bar)
1111 nigel 75 .sp
1112 nigel 63 matches any occurrence of "foo" that is not followed by "bar". Note that the
1113     apparently similar pattern
1114 nigel 75 .sp
1115 nigel 63 (?!foo)bar
1116 nigel 75 .sp
1117 nigel 63 does not find an occurrence of "bar" that is preceded by something other than
1118     "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
1119     (?!foo) is always true when the next three characters are "bar". A
1120 nigel 75 lookbehind assertion is needed to achieve the other effect.
1121     .P
1122 nigel 63 If you want to force a matching failure at some point in a pattern, the most
1123     convenient way to do it is with (?!) because an empty string always matches, so
1124     an assertion that requires there not to be an empty string must always fail.
1125 nigel 75 .
1126     .
1127     .\" HTML <a name="lookbehind"></a>
1128     .SS "Lookbehind assertions"
1129     .rs
1130     .sp
1131 nigel 63 Lookbehind assertions start with (?<= for positive assertions and (?<! for
1132     negative assertions. For example,
1133 nigel 75 .sp
1134 nigel 63 (?<!foo)bar
1135 nigel 75 .sp
1136 nigel 63 does find an occurrence of "bar" that is not preceded by "foo". The contents of
1137     a lookbehind assertion are restricted such that all the strings it matches must
1138     have a fixed length. However, if there are several alternatives, they do not
1139     all have to have the same fixed length. Thus
1140 nigel 75 .sp
1141 nigel 63 (?<=bullock|donkey)
1142 nigel 75 .sp
1143 nigel 63 is permitted, but
1144 nigel 75 .sp
1145 nigel 63 (?<!dogs?|cats?)
1146 nigel 75 .sp
1147 nigel 63 causes an error at compile time. Branches that match different length strings
1148     are permitted only at the top level of a lookbehind assertion. This is an
1149     extension compared with Perl (at least for 5.8), which requires all branches to
1150     match the same length of string. An assertion such as
1151 nigel 75 .sp
1152 nigel 63 (?<=ab(c|de))
1153 nigel 75 .sp
1154 nigel 63 is not permitted, because its single top-level branch can match two different
1155     lengths, but it is acceptable if rewritten to use two top-level branches:
1156 nigel 75 .sp
1157 nigel 63 (?<=abc|abde)
1158 nigel 75 .sp
1159 nigel 63 The implementation of lookbehind assertions is, for each alternative, to
1160     temporarily move the current position back by the fixed width and then try to
1161     match. If there are insufficient characters before the current position, the
1162     match is deemed to fail.
1163 nigel 75 .P
1164     PCRE does not allow the \eC escape (which matches a single byte in UTF-8 mode)
1165 nigel 63 to appear in lookbehind assertions, because it makes it impossible to calculate
1166 nigel 75 the length of the lookbehind. The \eX escape, which can match different numbers
1167     of bytes, is also not permitted.
1168     .P
1169 nigel 63 Atomic groups can be used in conjunction with lookbehind assertions to specify
1170     efficient matching at the end of the subject string. Consider a simple pattern
1171     such as
1172 nigel 75 .sp
1173 nigel 63 abcd$
1174 nigel 75 .sp
1175 nigel 63 when applied to a long string that does not match. Because matching proceeds
1176     from left to right, PCRE will look for each "a" in the subject and then see if
1177     what follows matches the rest of the pattern. If the pattern is specified as
1178 nigel 75 .sp
1179 nigel 63 ^.*abcd$
1180 nigel 75 .sp
1181 nigel 63 the initial .* matches the entire string at first, but when this fails (because
1182     there is no following "a"), it backtracks to match all but the last character,
1183     then all but the last two characters, and so on. Once again the search for "a"
1184     covers the entire string, from right to left, so we are no better off. However,
1185     if the pattern is written as
1186 nigel 75 .sp
1187 nigel 63 ^(?>.*)(?<=abcd)
1188 nigel 75 .sp
1189     or, equivalently, using the possessive quantifier syntax,
1190     .sp
1191 nigel 63 ^.*+(?<=abcd)
1192 nigel 75 .sp
1193 nigel 63 there can be no backtracking for the .* item; it can match only the entire
1194     string. The subsequent lookbehind assertion does a single test on the last four
1195     characters. If it fails, the match fails immediately. For long strings, this
1196     approach makes a significant difference to the processing time.
1197 nigel 75 .
1198     .
1199     .SS "Using multiple assertions"
1200     .rs
1201     .sp
1202 nigel 63 Several assertions (of any sort) may occur in succession. For example,
1203 nigel 75 .sp
1204     (?<=\ed{3})(?<!999)foo
1205     .sp
1206 nigel 63 matches "foo" preceded by three digits that are not "999". Notice that each of
1207     the assertions is applied independently at the same point in the subject
1208     string. First there is a check that the previous three characters are all
1209     digits, and then there is a check that the same three characters are not "999".
1210 nigel 75 This pattern does \fInot\fP match "foo" preceded by six characters, the first
1211 nigel 63 of which are digits and the last three of which are not "999". For example, it
1212     doesn't match "123abcfoo". A pattern to do that is
1213 nigel 75 .sp
1214     (?<=\ed{3}...)(?<!999)foo
1215     .sp
1216 nigel 63 This time the first assertion looks at the preceding six characters, checking
1217     that the first three are digits, and then the second assertion checks that the
1218     preceding three characters are not "999".
1219 nigel 75 .P
1220 nigel 63 Assertions can be nested in any combination. For example,
1221 nigel 75 .sp
1222 nigel 63 (?<=(?<!foo)bar)baz
1223 nigel 75 .sp
1224 nigel 63 matches an occurrence of "baz" that is preceded by "bar" which in turn is not
1225     preceded by "foo", while
1226 nigel 75 .sp
1227     (?<=\ed{3}(?!999)...)foo
1228     .sp
1229     is another pattern that matches "foo" preceded by three digits and any three
1230 nigel 63 characters that are not "999".
1231 nigel 75 .
1232     .
1234 nigel 63 .rs
1235     .sp
1236     It is possible to cause the matching process to obey a subpattern
1237     conditionally or to choose between two alternative subpatterns, depending on
1238     the result of an assertion, or whether a previous capturing subpattern matched
1239     or not. The two possible forms of conditional subpattern are
1240 nigel 75 .sp
1241 nigel 63 (?(condition)yes-pattern)
1242     (?(condition)yes-pattern|no-pattern)
1243 nigel 75 .sp
1244 nigel 63 If the condition is satisfied, the yes-pattern is used; otherwise the
1245     no-pattern (if present) is used. If there are more than two alternatives in the
1246     subpattern, a compile-time error occurs.
1247 nigel 75 .P
1248 nigel 63 There are three kinds of condition. If the text between the parentheses
1249     consists of a sequence of digits, the condition is satisfied if the capturing
1250     subpattern of that number has previously matched. The number must be greater
1251     than zero. Consider the following pattern, which contains non-significant white
1252     space to make it more readable (assume the PCRE_EXTENDED option) and to divide
1253     it into three parts for ease of discussion:
1254 nigel 75 .sp
1255     ( \e( )? [^()]+ (?(1) \e) )
1256     .sp
1257 nigel 63 The first part matches an optional opening parenthesis, and if that
1258     character is present, sets it as the first captured substring. The second part
1259     matches one or more characters that are not parentheses. The third part is a
1260     conditional subpattern that tests whether the first set of parentheses matched
1261     or not. If they did, that is, if subject started with an opening parenthesis,
1262     the condition is true, and so the yes-pattern is executed and a closing
1263     parenthesis is required. Otherwise, since no-pattern is not present, the
1264     subpattern matches nothing. In other words, this pattern matches a sequence of
1265     non-parentheses, optionally enclosed in parentheses.
1266 nigel 75 .P
1267 nigel 63 If the condition is the string (R), it is satisfied if a recursive call to the
1268     pattern or subpattern has been made. At "top level", the condition is false.
1269     This is a PCRE extension. Recursive patterns are described in the next section.
1270 nigel 75 .P
1271 nigel 63 If the condition is not a sequence of digits or (R), it must be an assertion.
1272     This may be a positive or negative lookahead or lookbehind assertion. Consider
1273     this pattern, again containing non-significant white space, and with the two
1274     alternatives on the second line:
1275 nigel 75 .sp
1276 nigel 63 (?(?=[^a-z]*[a-z])
1277 nigel 75 \ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
1278     .sp
1279 nigel 63 The condition is a positive lookahead assertion that matches an optional
1280     sequence of non-letters followed by a letter. In other words, it tests for the
1281     presence of at least one letter in the subject. If a letter is found, the
1282     subject is matched against the first alternative; otherwise it is matched
1283     against the second. This pattern matches strings in one of the two forms
1284     dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
1285 nigel 75 .
1286     .
1287     .\" HTML <a name="comments"></a>
1288 nigel 63 .SH COMMENTS
1289     .rs
1290     .sp
1291 nigel 75 The sequence (?# marks the start of a comment that continues up to the next
1292 nigel 63 closing parenthesis. Nested parentheses are not permitted. The characters
1293     that make up a comment play no part in the pattern matching at all.
1294 nigel 75 .P
1295 nigel 63 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1296     character class introduces a comment that continues up to the next newline
1297     character in the pattern.
1298 nigel 75 .
1299     .
1301 nigel 63 .rs
1302     .sp
1303     Consider the problem of matching a string in parentheses, allowing for
1304     unlimited nested parentheses. Without the use of recursion, the best that can
1305     be done is to use a pattern that matches up to some fixed depth of nesting. It
1306 nigel 75 is not possible to handle an arbitrary nesting depth. Perl provides a facility
1307     that allows regular expressions to recurse (amongst other things). It does this
1308     by interpolating Perl code in the expression at run time, and the code can
1309     refer to the expression itself. A Perl pattern to solve the parentheses problem
1310     can be created like this:
1311     .sp
1312     $re = qr{\e( (?: (?>[^()]+) | (?p{$re}) )* \e)}x;
1313     .sp
1314 nigel 63 The (?p{...}) item interpolates Perl code at run time, and in this case refers
1315     recursively to the pattern in which it appears. Obviously, PCRE cannot support
1316     the interpolation of Perl code. Instead, it supports some special syntax for
1317     recursion of the entire pattern, and also for individual subpattern recursion.
1318 nigel 75 .P
1319 nigel 63 The special item that consists of (? followed by a number greater than zero and
1320     a closing parenthesis is a recursive call of the subpattern of the given
1321     number, provided that it occurs inside that subpattern. (If not, it is a
1322     "subroutine" call, which is described in the next section.) The special item
1323     (?R) is a recursive call of the entire regular expression.
1324 nigel 75 .P
1325 nigel 63 For example, this PCRE pattern solves the nested parentheses problem (assume
1326     the PCRE_EXTENDED option is set so that white space is ignored):
1327 nigel 75 .sp
1328     \e( ( (?>[^()]+) | (?R) )* \e)
1329     .sp
1330 nigel 63 First it matches an opening parenthesis. Then it matches any number of
1331     substrings which can either be a sequence of non-parentheses, or a recursive
1332     match of the pattern itself (that is a correctly parenthesized substring).
1333     Finally there is a closing parenthesis.
1334 nigel 75 .P
1335 nigel 63 If this were part of a larger pattern, you would not want to recurse the entire
1336     pattern, so instead you could use this:
1337 nigel 75 .sp
1338     ( \e( ( (?>[^()]+) | (?1) )* \e) )
1339     .sp
1340 nigel 63 We have put the pattern into parentheses, and caused the recursion to refer to
1341     them instead of the whole pattern. In a larger pattern, keeping track of
1342     parenthesis numbers can be tricky. It may be more convenient to use named
1343     parentheses instead. For this, PCRE uses (?P>name), which is an extension to
1344     the Python syntax that PCRE uses for named parentheses (Perl does not provide
1345     named parentheses). We could rewrite the above example as follows:
1346 nigel 75 .sp
1347     (?P<pn> \e( ( (?>[^()]+) | (?P>pn) )* \e) )
1348     .sp
1349 nigel 63 This particular example pattern contains nested unlimited repeats, and so the
1350     use of atomic grouping for matching strings of non-parentheses is important
1351     when applying the pattern to strings that do not match. For example, when this
1352     pattern is applied to
1353 nigel 75 .sp
1354 nigel 63 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1355 nigel 75 .sp
1356 nigel 63 it yields "no match" quickly. However, if atomic grouping is not used,
1357     the match runs for a very long time indeed because there are so many different
1358     ways the + and * repeats can carve up the subject, and all have to be tested
1359     before failure can be reported.
1360 nigel 75 .P
1361 nigel 63 At the end of a match, the values set for any capturing subpatterns are those
1362     from the outermost level of the recursion at which the subpattern value is set.
1363     If you want to obtain intermediate values, a callout function can be used (see
1364 nigel 75 the next section and the
1365 nigel 63 .\" HREF
1366 nigel 75 \fBpcrecallout\fP
1367 nigel 63 .\"
1368     documentation). If the pattern above is matched against
1369 nigel 75 .sp
1370 nigel 63 (ab(cd)ef)
1371 nigel 75 .sp
1372 nigel 63 the value for the capturing parentheses is "ef", which is the last value taken
1373     on at the top level. If additional parentheses are added, giving
1374 nigel 75 .sp
1375     \e( ( ( (?>[^()]+) | (?R) )* ) \e)
1376 nigel 63 ^ ^
1377     ^ ^
1378 nigel 75 .sp
1379 nigel 63 the string they capture is "ab(cd)ef", the contents of the top level
1380     parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE
1381     has to obtain extra memory to store data during a recursion, which it does by
1382 nigel 75 using \fBpcre_malloc\fP, freeing it via \fBpcre_free\fP afterwards. If no
1383 nigel 63 memory can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
1384 nigel 75 .P
1385 nigel 63 Do not confuse the (?R) item with the condition (R), which tests for recursion.
1386     Consider this pattern, which matches text in angle brackets, allowing for
1387     arbitrary nesting. Only digits are allowed in nested brackets (that is, when
1388     recursing), whereas any characters are permitted at the outer level.
1389 nigel 75 .sp
1390     < (?: (?(R) \ed++ | [^<>]*+) | (?R)) * >
1391     .sp
1392 nigel 63 In this pattern, (?(R) is the start of a conditional subpattern, with two
1393     different alternatives for the recursive and non-recursive cases. The (?R) item
1394     is the actual recursive call.
1395 nigel 75 .
1396     .
1397 nigel 63 .\" HTML <a name="subpatternsassubroutines"></a>
1399 nigel 63 .rs
1400     .sp
1401     If the syntax for a recursive subpattern reference (either by number or by
1402     name) is used outside the parentheses to which it refers, it operates like a
1403     subroutine in a programming language. An earlier example pointed out that the
1404     pattern
1405 nigel 75 .sp
1406     (sens|respons)e and \e1ibility
1407     .sp
1408 nigel 63 matches "sense and sensibility" and "response and responsibility", but not
1409     "sense and responsibility". If instead the pattern
1410 nigel 75 .sp
1411 nigel 63 (sens|respons)e and (?1)ibility
1412 nigel 75 .sp
1413 nigel 63 is used, it does match "sense and responsibility" as well as the other two
1414     strings. Such references must, however, follow the subpattern to which they
1415     refer.
1416 nigel 75 .
1417     .
1418 nigel 63 .SH CALLOUTS
1419     .rs
1420     .sp
1421     Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
1422     code to be obeyed in the middle of matching a regular expression. This makes it
1423     possible, amongst other things, to extract different substrings that match the
1424     same pair of parentheses when there is a repetition.
1425 nigel 75 .P
1426 nigel 63 PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
1427     code. The feature is called "callout". The caller of PCRE provides an external
1428 nigel 75 function by putting its entry point in the global variable \fIpcre_callout\fP.
1429 nigel 63 By default, this variable contains NULL, which disables all calling out.
1430 nigel 75 .P
1431 nigel 63 Within a regular expression, (?C) indicates the points at which the external
1432     function is to be called. If you want to identify different callout points, you
1433     can put a number less than 256 after the letter C. The default value is zero.
1434     For example, this pattern has two callout points:
1435 nigel 75 .sp
1436 nigel 63 (?C1)\dabc(?C2)def
1437 nigel 75 .sp
1438     If the PCRE_AUTO_CALLOUT flag is passed to \fBpcre_compile()\fP, callouts are
1439     automatically installed before each item in the pattern. They are all numbered
1440     255.
1441     .P
1442     During matching, when PCRE reaches a callout point (and \fIpcre_callout\fP is
1443 nigel 63 set), the external function is called. It is provided with the number of the
1444 nigel 75 callout, the position in the pattern, and, optionally, one item of data
1445     originally supplied by the caller of \fBpcre_exec()\fP. The callout function
1446     may cause matching to proceed, to backtrack, or to fail altogether. A complete
1447     description of the interface to the callout function is given in the
1448 nigel 63 .\" HREF
1449 nigel 75 \fBpcrecallout\fP
1450 nigel 63 .\"
1451     documentation.
1452 nigel 75 .P
1453 nigel 63 .in 0
1454 nigel 75 Last updated: 09 September 2004
1455 nigel 63 .br
1456 nigel 75 Copyright (c) 1997-2004 University of Cambridge.

ViewVC Help
Powered by ViewVC 1.1.12