/[pcre]/code/trunk/pcre.3
ViewVC logotype

Diff of /code/trunk/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 3 by nigel, Sat Feb 24 21:38:01 2007 UTC revision 23 by nigel, Sat Feb 24 21:38:41 2007 UTC
# Line 8  pcre - Perl-compatible regular expressio Line 8  pcre - Perl-compatible regular expressio
8  .br  .br
9  .B pcre *pcre_compile(const char *\fIpattern\fR, int \fIoptions\fR,  .B pcre *pcre_compile(const char *\fIpattern\fR, int \fIoptions\fR,
10  .ti +5n  .ti +5n
11  .B char **\fIerrptr\fR, int *\fIerroffset\fR);  .B const char **\fIerrptr\fR, int *\fIerroffset\fR);
12  .PP  .PP
13  .br  .br
14  .B pcre_extra *pcre_study(const pcre *\fIcode\fR, int \fIoptions\fR,  .B pcre_extra *pcre_study(const pcre *\fIcode\fR, int \fIoptions\fR,
15  .ti +5n  .ti +5n
16  .B char **\fIerrptr\fR);  .B const char **\fIerrptr\fR);
17  .PP  .PP
18  .br  .br
19  .B int pcre_exec(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR,"  .B int pcre_exec(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR,"
# Line 52  pcre - Perl-compatible regular expressio Line 52  pcre - Perl-compatible regular expressio
52  .SH DESCRIPTION  .SH DESCRIPTION
53  The PCRE library is a set of functions that implement regular expression  The PCRE library is a set of functions that implement regular expression
54  pattern matching using the same syntax and semantics as Perl 5, with just a few  pattern matching using the same syntax and semantics as Perl 5, with just a few
55  differences (see below). The current implementation corresponds to Perl 5.004.  differences (see below). The current implementation corresponds to Perl 5.005.
56    
57  PCRE has its own native API, which is described in this man page. There is also  PCRE has its own native API, which is described in this man page. There is also
58  a set of wrapper functions that correspond to the POSIX API. See  a set of wrapper functions that correspond to the POSIX API. See
# Line 72  should be done before calling any PCRE f Line 72  should be done before calling any PCRE f
72    
73  The other global variables are character tables. They are initialized when PCRE  The other global variables are character tables. They are initialized when PCRE
74  is compiled, from source that is generated by reference to the C character type  is compiled, from source that is generated by reference to the C character type
75  functions, but which the maintainer of PCRE is free to modify. In principle  functions, but which a user of PCRE is free to modify. In principle the tables
76  they could also be modified at runtime. See PCRE's README file for more  could also be modified at run time. See PCRE's README file for more details.
 details.  
77    
78    
79  .SH MULTI-THREADING  .SH MULTI-THREADING
80  The PCRE functions can be used in multi-threading applications, with the  The PCRE functions can be used in multi-threading applications, with the
81  proviso that the character tables and the memory management functions pointed  proviso that the character tables and the memory management functions pointed
82  to by \fBpcre_malloc\fR and \fBpcre_free\fR will be shared by all threads.  to by \fBpcre_malloc\fR and \fBpcre_free\fR are shared by all threads.
83    
84  The compiled form of a regular expression is not altered during matching, so  The compiled form of a regular expression is not altered during matching, so
85  the same compiled pattern can safely be used by several threads at once.  the same compiled pattern can safely be used by several threads at once.
# Line 101  quantifiers with a minimum greater than Line 100  quantifiers with a minimum greater than
100  relevant portions of the compiled pattern to be replicated.  relevant portions of the compiled pattern to be replicated.
101  .PP  .PP
102  The \fIoptions\fR argument contains independent bits that affect the  The \fIoptions\fR argument contains independent bits that affect the
103  compilation. It should be zero if no options are required. Those options that  compilation. It should be zero if no options are required. Some of the options,
104  are compabible with Perl can also be set at compile time from within the  in particular, those that are compatible with Perl, can also be set and unset
105  pattern (see the detailed description of regular expressions below) and all  from within the pattern (see the detailed description of regular expressions
106  options except PCRE_EXTENDED and PCRE_EXTRA can be set at the time of matching.  below). For these options, the contents of the \fIoptions\fR argument specifies
107    their initial settings at the start of compilation and execution. The
108    PCRE_ANCHORED option can be set at the time of matching as well as at compile
109    time.
110  .PP  .PP
111  If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately.  If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately.
112  Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns  Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns
# Line 127  constructs in the pattern itself, which Line 129  constructs in the pattern itself, which
129    PCRE_CASELESS    PCRE_CASELESS
130    
131  If this bit is set, letters in the pattern match both upper and lower case  If this bit is set, letters in the pattern match both upper and lower case
132  letters in any subject string. It is equivalent to Perl's /i option.  letters. It is equivalent to Perl's /i option.
133    
134    PCRE_DOLLAR_ENDONLY    PCRE_DOLLAR_ENDONLY
135    
136  If this bit is set, a dollar metacharacter in the pattern matches only at the  If this bit is set, a dollar metacharacter in the pattern matches only at the
137  end of the subject string. By default, it also matches immediately before the  end of the subject string. Without this option, a dollar also matches
138  final character if it is a newline (but not before any other newlines). The  immediately before the final character if it is a newline (but not before any
139  PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. There is no  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
140  equivalent to this option in Perl.  set. There is no equivalent to this option in Perl.
141    
142    PCRE_DOTALL    PCRE_DOTALL
143    
144  If this bit is set, a dot metacharater in the pattern matches all characters,  If this bit is set, a dot metacharater in the pattern matches all characters,
145  including newlines. By default, newlines are excluded. This option is  including newlines. Without it, newlines are excluded. This option is
146  equivalent to Perl's /s option. A negative class such as [^a] always matches a  equivalent to Perl's /s option. A negative class such as [^a] always matches a
147  newline character, independent of the setting of this option.  newline character, independent of the setting of this option.
148    
149    PCRE_EXTENDED    PCRE_EXTENDED
150    
151  If this bit is set, whitespace characters in the pattern are totally ignored  If this bit is set, whitespace data characters in the pattern are totally
152  except when escaped or inside a character class, and characters between an  ignored except when escaped or inside a character class, and characters between
153  unescaped # outside a character class and the next newline character,  an unescaped # outside a character class and the next newline character,
154  inclusive, are also ignored. This is equivalent to Perl's /x option, and makes  inclusive, are also ignored. This is equivalent to Perl's /x option, and makes
155  it possible to include comments inside complicated patterns.  it possible to include comments inside complicated patterns. Note, however,
156    that this applies only to data characters. Whitespace characters may never
157    appear within special character sequences in a pattern, for example within the
158    sequence (?( which introduces a conditional subpattern.
159    
160      PCRE_EXTRA
161    
162    This option turns on additional functionality of PCRE that is incompatible with
163    Perl. Any backslash in a pattern that is followed by a letter that has no
164    special meaning causes an error, thus reserving these combinations for future
165    expansion. By default, as in Perl, a backslash followed by a letter with no
166    special meaning is treated as a literal. There are at present no other features
167    controlled by this option.
168    
169    PCRE_MULTILINE    PCRE_MULTILINE
170    
# Line 158  By default, PCRE treats the subject stri Line 172  By default, PCRE treats the subject stri
172  characters (even if it actually contains several newlines). The "start of line"  characters (even if it actually contains several newlines). The "start of line"
173  metacharacter (^) matches only at the start of the string, while the "end of  metacharacter (^) matches only at the start of the string, while the "end of
174  line" metacharacter ($) matches only at the end of the string, or before a  line" metacharacter ($) matches only at the end of the string, or before a
175  terminating newline. This is the same as Perl.  terminating newline (unless PCRE_DOLLAR_ENDONLY is set). This is the same as
176    Perl.
177    
178  When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs  When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs
179  match immediately following or immediately before any newline in the subject  match immediately following or immediately before any newline in the subject
# Line 167  to Perl's /m option. If there are no "\\ Line 182  to Perl's /m option. If there are no "\\
182  no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no  no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no
183  effect.  effect.
184    
185    PCRE_EXTRA    PCRE_UNGREEDY
   
 This option turns on additional functionality of PCRE that is incompatible with  
 Perl. Any backslash in a pattern that is followed by a letter that has no  
 special meaning causes an error, thus reserving these combinations for future  
 expansion. By default, as in Perl, a backslash followed by a letter with no  
 special meaning is treated as a literal. There are two extra features currently  
 provided, and both are in some sense experimental additions that are useful for  
 influencing the progress of a match.  
   
   (1) The sequence \\X inserts a Prolog-like "cut" into the expression.  
   
   (2) Once a subpattern enclosed in (?>subpat) brackets has matched,  
       backtracking never goes back into the pattern.  
   
 See below for further details of both of these.  
186    
187    This option inverts the "greediness" of the quantifiers so that they are not
188    greedy by default, but become greedy if followed by "?". It is not compatible
189    with Perl. It can also be set by a (?U) option setting within the pattern.
190    
191    
192  .SH STUDYING A PATTERN  .SH STUDYING A PATTERN
# Line 195  typedef) containing additional informati Line 198  typedef) containing additional informati
198  passed to \fBpcre_exec()\fR. If no additional information is available, NULL  passed to \fBpcre_exec()\fR. If no additional information is available, NULL
199  is returned.  is returned.
200    
201  The second argument contains option bits. The only one currently supported is  The second argument contains option bits. At present, no options are defined
202  PCRE_CASELESS. It forces the studying to be done in a caseless manner, even if  for \fBpcre_study()\fR, and this argument should always be zero.
 the original pattern was compiled without PCRE_CASELESS. When the result of  
 \fBpcre_study()\fR is passed to \fBpcre_exec()\fR, it is used only if its  
 caseless state is the same as that of the matching process. A pattern that is  
 compiled without PCRE_CASELESS can be studied with and without PCRE_CASELESS,  
 and the appropriate data passed to \fBpcre_exec()\fR with and without the  
 PCRE_CASELESS flag.  
203    
204  The third argument for \fBpcre_study()\fR is a pointer to an error message. If  The third argument for \fBpcre_study()\fR is a pointer to an error message. If
205  studying succeeds (even if no data is returned), the variable it points to is  studying succeeds (even if no data is returned), the variable it points to is
# Line 222  pattern has been studied, the result of Line 219  pattern has been studied, the result of
219  The subject string is passed as a pointer in \fIsubject\fR and a length in  The subject string is passed as a pointer in \fIsubject\fR and a length in
220  \fIlength\fR. Unlike the pattern string, it may contain binary zero characters.  \fIlength\fR. Unlike the pattern string, it may contain binary zero characters.
221    
222  The options PCRE_ANCHORED, PCRE_CASELESS, PCRE_DOLLAR_ENDONLY, PCRE_DOTALL, and  The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose
223  PCRE_MULTILINE can be passed in the \fIoptions\fR argument, whose unused bits  unused bits must be zero. However, if a pattern was compiled with
224  must be zero. However, if a pattern is compiled with any of these options, they  PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it
225  cannot be unset when it is obeyed.  cannot be made unachored at matching time.
226    
227  There are also two further options that can be set only at matching time:  There are also two further options that can be set only at matching time:
228    
# Line 233  There are also two further options that Line 230  There are also two further options that
230    
231  The first character of the string is not the beginning of a line, so the  The first character of the string is not the beginning of a line, so the
232  circumflex metacharacter should not match before it. Setting this without  circumflex metacharacter should not match before it. Setting this without
233  PCRE_MULTILINE (at either compile or match time) causes circumflex never to  PCRE_MULTILINE (at compile time) causes circumflex never to match.
 match.  
234    
235    PCRE_NOTEOL    PCRE_NOTEOL
236    
237  The end of the string is not the end of a line, so the dollar metacharacter  The end of the string is not the end of a line, so the dollar metacharacter
238  should not match it. Setting this without PCRE_MULTILINE (at either compile or  should not match it nor (except in multiline mode) a newline immediately before
239  match time) causes dollar never to match.  it. Setting this without PCRE_MULTILINE (at compile time) causes dollar never
240    to match.
241    
242  In general, a pattern matches a certain portion of the subject, and in  In general, a pattern matches a certain portion of the subject, and in
243  addition, further substrings from the subject may be picked out by parts of the  addition, further substrings from the subject may be picked out by parts of the
# Line 251  kinds of parenthesized subpattern that d Line 248  kinds of parenthesized subpattern that d
248    
249  Captured substrings are returned to the caller via a vector of integer offsets  Captured substrings are returned to the caller via a vector of integer offsets
250  whose address is passed in \fIovector\fR. The number of elements in the vector  whose address is passed in \fIovector\fR. The number of elements in the vector
251  is passed in \fIovecsize\fR. This should always be an even number, because the  is passed in \fIovecsize\fR. The first two-thirds of the vector is used to pass
252  elements are used in pairs. If an odd number is passed, it is rounded down.  back captured substrings, each substring using a pair of integers. The
253    remaining third of the vector is used as workspace by \fBpcre_exec()\fR while
254  The first element of a pair is set to the offset of the first character in a  matching capturing subpatterns, and is not available for passing back
255  substring, and the second is set to the offset of the first character after the  information. The length passed in \fIovecsize\fR should always be a multiple of
256  end of a substring. The first pair, \fIovector[0]\fR and \fIovector[1]\fR,  three. If it is not, it is rounded down.
257  identify the portion of the subject string matched by the entire pattern. The  
258  next pair is used for the first capturing subpattern, and so on. The value  When a match has been successful, information about captured substrings is
259  returned by \fBpcre_exec()\fR is the number of pairs that have been set. If  returned in pairs of integers, starting at the beginning of \fIovector\fR, and
260  there are no capturing subpatterns, the return value from a successful match  continuing up to two-thirds of its length at the most. The first element of a
261  is 1, indicating that just the first pair of offsets has been set.  pair is set to the offset of the first character in a substring, and the second
262    is set to the offset of the first character after the end of a substring. The
263    first pair, \fIovector[0]\fR and \fIovector[1]\fR, identify the portion of the
264    subject string matched by the entire pattern. The next pair is used for the
265    first capturing subpattern, and so on. The value returned by \fBpcre_exec()\fR
266    is the number of pairs that have been set. If there are no capturing
267    subpatterns, the return value from a successful match is 1, indicating that
268    just the first pair of offsets has been set.
269    
270  It is possible for an capturing subpattern number \fIn+1\fR to match some  It is possible for an capturing subpattern number \fIn+1\fR to match some
271  part of the subject when subpattern \fIn\fR has not been used at all. For  part of the subject when subpattern \fIn\fR has not been used at all. For
272  example, if the string "abc" is matched against the pattern "(a|(z))(bc)",  example, if the string "abc" is matched against the pattern (a|(z))(bc)
273  subpatterns 1 and 3 are matched, but 2 is not. When this happens, both offset  subpatterns 1 and 3 are matched, but 2 is not. When this happens, both offset
274  values corresponding to the unused subpattern are set to -1.  values corresponding to the unused subpattern are set to -1.
275    
# Line 273  If a capturing subpattern is matched rep Line 277  If a capturing subpattern is matched rep
277  string that it matched that gets returned.  string that it matched that gets returned.
278    
279  If the vector is too small to hold all the captured substrings, it is used as  If the vector is too small to hold all the captured substrings, it is used as
280  far as possible, and the function returns a value of zero. In particular, if  far as possible (up to two-thirds of its length), and the function returns a
281  the substring offsets are not of interest, \fBpcre_exec()\fR may be called with  value of zero. In particular, if the substring offsets are not of interest,
282  \fIovector\fR passed as NULL and \fIovecsize\fR as zero. However, if the  \fBpcre_exec()\fR may be called with \fIovector\fR passed as NULL and
283  pattern contains back references and the \fIovector\fR isn't big enough to  \fIovecsize\fR as zero. However, if the pattern contains back references and
284  remember the related substrings, PCRE has to get additional memory for use  the \fIovector\fR isn't big enough to remember the related substrings, PCRE has
285  during matching. Thus it is usually advisable to supply an \fIovector\fR.  to get additional memory for use during matching. Thus it is usually advisable
286    to supply an \fIovector\fR.
287    
288  Note that \fBpcre_info()\fR can be used to find out how many capturing  Note that \fBpcre_info()\fR can be used to find out how many capturing
289  subpatterns there are in a compiled pattern.  subpatterns there are in a compiled pattern. The smallest size for
290    \fIovector\fR that will allow for \fIn\fR captured substrings in addition to
291    the offsets of the substring matched by the whole pattern is (\fIn\fR+1)*3.
292    
293  If \fBpcre_exec()\fR fails, it returns a negative number. The following are  If \fBpcre_exec()\fR fails, it returns a negative number. The following are
294  defined in the header file:  defined in the header file:
# Line 290  defined in the header file: Line 297  defined in the header file:
297    
298  The subject string did not match the pattern.  The subject string did not match the pattern.
299    
300    PCRE_ERROR_BADREF         (-2)    PCRE_ERROR_NULL           (-2)
   
 There was a back-reference in the pattern to a capturing subpattern that had  
 not previously been set.  
   
   PCRE_ERROR_NULL           (-3)  
301    
302  Either \fIcode\fR or \fIsubject\fR was passed as NULL, or \fIovector\fR was  Either \fIcode\fR or \fIsubject\fR was passed as NULL, or \fIovector\fR was
303  NULL and \fIovecsize\fR was not zero.  NULL and \fIovecsize\fR was not zero.
304    
305    PCRE_ERROR_BADOPTION      (-4)    PCRE_ERROR_BADOPTION      (-3)
306    
307  An unrecognized bit was set in the \fIoptions\fR argument.  An unrecognized bit was set in the \fIoptions\fR argument.
308    
309    PCRE_ERROR_BADMAGIC       (-5)    PCRE_ERROR_BADMAGIC       (-4)
310    
311  PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch  PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch
312  the case when it is passed a junk pointer. This is the error it gives when the  the case when it is passed a junk pointer. This is the error it gives when the
313  magic number isn't present.  magic number isn't present.
314    
315    PCRE_ERROR_UNKNOWN_NODE   (-6)    PCRE_ERROR_UNKNOWN_NODE   (-5)
316    
317  While running the pattern match, an unknown item was encountered in the  While running the pattern match, an unknown item was encountered in the
318  compiled pattern. This error could be caused by a bug in PCRE or by overwriting  compiled pattern. This error could be caused by a bug in PCRE or by overwriting
319  of the compiled pattern.  of the compiled pattern.
320    
321    PCRE_ERROR_NOMEMORY       (-7)    PCRE_ERROR_NOMEMORY       (-6)
322    
323  If a pattern contains back references, but the \fIovector\fR that is passed to  If a pattern contains back references, but the \fIovector\fR that is passed to
324  \fBpcre_exec()\fR is not big enough to remember the referenced substrings, PCRE  \fBpcre_exec()\fR is not big enough to remember the referenced substrings, PCRE
# Line 353  The maximum length of a compiled pattern Line 355  The maximum length of a compiled pattern
355  All values in repeating quantifiers must be less than 65536.  All values in repeating quantifiers must be less than 65536.
356  The maximum number of capturing subpatterns is 99.  The maximum number of capturing subpatterns is 99.
357  The maximum number of all parenthesized subpatterns, including capturing  The maximum number of all parenthesized subpatterns, including capturing
358  subpatterns and assertions, is 200.  subpatterns, assertions, and other types of subpattern, is 200.
359    
360  The maximum length of a subject string is the largest positive number that an  The maximum length of a subject string is the largest positive number that an
361  integer variable can hold. However, PCRE uses recursion to handle subpatterns  integer variable can hold. However, PCRE uses recursion to handle subpatterns
# Line 362  the size of a subject string that can be Line 364  the size of a subject string that can be
364    
365    
366  .SH DIFFERENCES FROM PERL  .SH DIFFERENCES FROM PERL
367  The differences described here are with respect to Perl 5.004.  The differences described here are with respect to Perl 5.005.
368    
369  1. By default, a whitespace character is any character that the C library  1. By default, a whitespace character is any character that the C library
370  function \fBisspace()\fR recognizes, though it is possible to compile PCRE with  function \fBisspace()\fR recognizes, though it is possible to compile PCRE with
# Line 371  formfeed, newline, carriage return, hori Line 373  formfeed, newline, carriage return, hori
373  no longer includes vertical tab in its set of whitespace characters. The \\v  no longer includes vertical tab in its set of whitespace characters. The \\v
374  escape that was in the Perl documentation for a long time was never in fact  escape that was in the Perl documentation for a long time was never in fact
375  recognized. However, the character itself was treated as whitespace at least  recognized. However, the character itself was treated as whitespace at least
376  up to 5.002. In 5.004 it does not match \\s.  up to 5.002. In 5.004 and 5.005 it does not match \\s.
377    
378  2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits  2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits
379  them, but they do not mean what you might think. For example, "(?!a){3}" does  them, but they do not mean what you might think. For example, (?!a){3} does
380  not assert that the next three characters are not "a". It just asserts that the  not assert that the next three characters are not "a". It just asserts that the
381  next character is not "a" three times.  next character is not "a" three times.
382    
# Line 396  are not part of its pattern matching eng Line 398  are not part of its pattern matching eng
398  6. The Perl \\G assertion is not supported as it is not relevant to single  6. The Perl \\G assertion is not supported as it is not relevant to single
399  pattern matches.  pattern matches.
400    
401  7. If a backreference can never be matched, PCRE diagnoses an error. In a case  7. Fairly obviously, PCRE does not support the (?{code}) construction.
 like  
402    
403    /(123)\\2/  8. There are at the time of writing some oddities in Perl 5.005_02 concerned
404    with the settings of captured strings when part of a pattern is repeated. For
405    example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
406    "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if
407    the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set.
408    
409  the error occurs at compile time. Perl gives no compile time error; version  In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the
410  5.004 either always fails to match, or gives a segmentation fault at runtime.  future Perl changes to a consistent state that is different, PCRE may change to
411  In more complicated cases such as  follow.
412    
413    /(1)(2)(3)(4)(5)(6)(7)(8)(9)(10\\10)/  9. Another as yet unresolved discrepancy is that in Perl 5.005_02 the pattern
414    /^(a)?(?(1)a|b)+$/ matches the string "a", whereas in PCRE it does not.
415    However, in both Perl and PCRE /^(a)?a/ matched against "a" leaves $1 unset.
416    
417  PCRE returns PCRE_ERROR_BADREF at run time. Perl always fails to match.  10. PCRE provides some extensions to the Perl regular expression facilities:
418    
419  8. PCRE provides some extensions to the Perl regular expression facilities:  (a) Although lookbehind assertions must match fixed length strings, each
420    alternative branch of a lookbehind assertion can match a different length of
421    string. Perl 5.005 requires them all to have the same length.
422    
423  (a) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ meta-  (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ meta-
424  character matches only at the very end of the string.  character matches only at the very end of the string.
425    
426  (b) If PCRE_EXTRA is set, the \\X assertion (a Prolog-like "cut") is  (c) If PCRE_EXTRA is set, a backslash followed by a letter with no special
427  recognized, and a backslash followed by a letter with no special meaning is  meaning is faulted.
428  faulted. There is also a new kind of parenthesized subpattern starting with (?>  
429  which has a block on backtracking into it once it has matched.  (d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is
430    inverted, that is, by default they are not greedy, but if followed by a
431    question mark they are.
432    
433    
434  .SH REGULAR EXPRESSION DETAILS  .SH REGULAR EXPRESSION DETAILS
# Line 484  non-alphameric with "\\" to specify that Line 495  non-alphameric with "\\" to specify that
495  if you want to match a backslash, you write "\\\\".  if you want to match a backslash, you write "\\\\".
496    
497  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
498  pattern and characters between a "#" outside a character class and the next  pattern (other than in a character class) and characters between a "#" outside
499  newline character are ignored. An escaping backslash can be used to include a  a character class and the next newline character are ignored. An escaping
500  whitespace or "#" character as part of the pattern.  backslash can be used to include a whitespace or "#" character as part of the
501    pattern.
502    
503  A second use of backslash provides a way of encoding non-printing characters  A second use of backslash provides a way of encoding non-printing characters
504  in patterns in a visible manner. There is no restriction on the appearance of  in patterns in a visible manner. There is no restriction on the appearance of
# Line 503  represents: Line 515  represents:
515    \\r     carriage return (hex 0D)    \\r     carriage return (hex 0D)
516    \\t     tab (hex 09)    \\t     tab (hex 09)
517    \\xhh   character with hex code hh    \\xhh   character with hex code hh
518    \\ddd   character with octal code ddd or backreference    \\ddd   character with octal code ddd, or backreference
519    
520  The precise effect of "\\cx" is as follows: if "x" is a lower case letter, it  The precise effect of "\\cx" is as follows: if "x" is a lower case letter, it
521  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 516  lower case). Line 528  lower case).
528  After "\\0" up to two further octal digits are read. In both cases, if there  After "\\0" up to two further octal digits are read. In both cases, if there
529  are fewer than two digits, just those that are present are used. Thus the  are fewer than two digits, just those that are present are used. Thus the
530  sequence "\\0\\x\\07" specifies two binary zeros followed by a BEL character.  sequence "\\0\\x\\07" specifies two binary zeros followed by a BEL character.
531  Make sure you supply two digits if the character that follows could otherwise  Make sure you supply two digits after the initial zero if the character that
532  be taken as another digit.  follows is itself an octal digit.
533    
534  The handling of a backslash followed by a digit other than 0 is complicated.  The handling of a backslash followed by a digit other than 0 is complicated.
535  Outside a character class, PCRE reads it and any following digits as a decimal  Outside a character class, PCRE reads it and any following digits as a decimal
# Line 573  one character of the appropriate type. I Line 585  one character of the appropriate type. I
585  end of the subject string, all of them fail, since there is no character to  end of the subject string, all of them fail, since there is no character to
586  match.  match.
587    
588  The fourth use of backslash is for certain assertions. An assertion specifies a  The fourth use of backslash is for certain simple assertions. An assertion
589  condition that has to be met at a particular point in a match, without  specifies a condition that has to be met at a particular point in a match,
590  consuming any characters from the subject string. The backslashed assertions  without consuming any characters from the subject string. The use of
591  are  subpatterns for more complicated assertions is described below. The backslashed
592    assertions are
593    
594    \\b     word boundary    \\b     word boundary
595    \\B     not a word boundary    \\B     not a word boundary
596    \\A     start of subject (independent of multiline mode)    \\A     start of subject (independent of multiline mode)
597    \\Z     end of subject (independent of multiline mode)    \\Z     end of subject or newline at end (independent of multiline mode)
598      \\z     end of subject (independent of multiline mode)
599    
600  Assertions may not appear in character classes (but note that "\\b" has a  These assertions may not appear in character classes (but note that "\\b" has a
601  different meaning, namely the backspace character, inside a character class).  different meaning, namely the backspace character, inside a character class).
602    
603  A word boundary is a position in the subject string where the current character  A word boundary is a position in the subject string where the current character
604  and the previous character do not both match "\\w" or "\\W" (i.e. one matches  and the previous character do not both match \\w or \\W (i.e. one matches
605  "\\w" and the other matches "\\W"), or the start or end of the string if the  \\w and the other matches \\W), or the start or end of the string if the
606  first or last character matches "\\w", respectively. More complicated  first or last character matches \\w, respectively.
607  assertions are also supported (see below).  
608    The \\A, \\Z, and \\z assertions differ from the traditional circumflex and
609  The "\\A" and "\\Z" assertions differ from the traditional "^" and "$"  dollar (described below) in that they only ever match at the very start and end
610  (described below) in that they only ever match at the very start and end of the  of the subject string, whatever options are set. They are not affected by the
611  subject string, respectively, whatever options are set.  PCRE_NOTBOL or PCRE_NOTEOL options. The difference between \\Z and \\z is that
612    \\Z matches before a newline that is the last character of the string as well
613  When the PCRE_EXTRA flag is set on a call to \fBpcre_compile()\fR, the  as at the end of the string, whereas \\z matches only at the end.
 additional assertion \\X, which has no equivalent in Perl, is recognized.  
 This operates like the "cut" operation in Prolog: it prevents the matching  
 operation from backtracking past it. For example, if the expression  
   
   .*/foo  
   
 is matched against the string "/foo/this/is/not" then after the initial greedy  
 .* has swallowed the whole string, it keeps backtracking right the way to the  
 beginning before failing. If, on the other hand, the expression is  
   
   .*/\\Xfoo  
   
 then once it has discovered that "/not" is not "/foo", backtracking ceases, and  
 the match fails. See also the section on "once-only" subpatterns below.  
   
614    
615    
616  .SH CIRCUMFLEX AND DOLLAR  .SH CIRCUMFLEX AND DOLLAR
617  Outside a character class, the circumflex character is an assertion which is  Outside a character class, in the default matching mode, the circumflex
618  true only if the current matching point is at the start of the subject string,  character is an assertion which is true only if the current matching point is
619  in the default matching mode. Inside a character class, circumflex has an  at the start of the subject string. Inside a character class, circumflex has an
620  entirely different meaning (see below).  entirely different meaning (see below).
621    
622  Circumflex need not be the first character of the pattern if a number of  Circumflex need not be the first character of the pattern if a number of
# Line 637  Dollar has no special meaning in a chara Line 636  Dollar has no special meaning in a chara
636    
637  The meaning of dollar can be changed so that it matches only at the very end of  The meaning of dollar can be changed so that it matches only at the very end of
638  the string, by setting the PCRE_DOLLAR_ENDONLY option at compile or matching  the string, by setting the PCRE_DOLLAR_ENDONLY option at compile or matching
639  time.  time. This does not affect the \\Z assertion.
640    
641  The meanings of the circumflex and dollar characters are changed if the  The meanings of the circumflex and dollar characters are changed if the
642  PCRE_MULTILINE option is set at compile or matching time. When this is the  PCRE_MULTILINE option is set. When this is the case, they match immediately
643  case, they match immediately after and immediately before an internal "\\n"  after and immediately before an internal "\\n" character, respectively, in
644  character, respectively, in addition to matching at the start and end of the  addition to matching at the start and end of the subject string. For example,
645  subject string. For example, the pattern /^abc$/ matches the subject string  the pattern /^abc$/ matches the subject string "def\\nabc" in multiline mode,
646  "def\\nabc" in multiline mode, but not otherwise. Consequently, patterns that  but not otherwise. Consequently, patterns that are anchored in single line mode
647  are anchored in single line mode because all branches start with "^" are not  because all branches start with "^" are not anchored in multiline mode. The
648  anchored in multiline mode. The PCRE_DOLLAR_ENDONLY option is ignored if  PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
649  PCRE_MULTILINE is set.  
650    Note that the sequences \\A, \\Z, and \\z can be used to match the start and
651  Note that the sequences "\\A" and "\\Z" can be used to match the start and end  end of the subject in both modes, and if all branches of a pattern start with
652  of the subject in both modes, and if all branches of a pattern start with "\\A"  \\A is it always anchored, whether PCRE_MULTILINE is set or not.
 is it always anchored.  
653    
654    
655  .SH FULL STOP (PERIOD, DOT)  .SH FULL STOP (PERIOD, DOT)
# Line 668  An opening square bracket introduces a c Line 666  An opening square bracket introduces a c
666  square bracket. A closing square bracket on its own is not special. If a  square bracket. A closing square bracket on its own is not special. If a
667  closing square bracket is required as a member of the class, it should be the  closing square bracket is required as a member of the class, it should be the
668  first data character in the class (after an initial circumflex, if present) or  first data character in the class (after an initial circumflex, if present) or
669  escaped with \\.  escaped with a backslash.
670    
671  A character class matches a single character in the subject; the character must  A character class matches a single character in the subject; the character must
672  be in the set of characters defined by the class, unless the first character in  be in the set of characters defined by the class, unless the first character in
673  the class is a circumflex, in which case the subject character must not be in  the class is a circumflex, in which case the subject character must not be in
674  the set defined by the class. If a circumflex is actually required as a member  the set defined by the class. If a circumflex is actually required as a member
675  of the class, ensure it is not the first character, or escape it with \\.  of the class, ensure it is not the first character, or escape it with a
676    backslash.
677    
678  For example, the character class [aeiou] matches any lower case vowel, while  For example, the character class [aeiou] matches any lower case vowel, while
679  [^aeiou] matches any character that is not a lower case vowel. Note that a  [^aeiou] matches any character that is not a lower case vowel. Note that a
# Line 683  are in the class by enumerating those th Line 682  are in the class by enumerating those th
682  still consumes a character from the subject string, and fails if the current  still consumes a character from the subject string, and fails if the current
683  pointer is at the end of the string.  pointer is at the end of the string.
684    
685    When PCRE_CASELESS is set, any letters in a class represent both their upper
686    case and lower case versions, so for example, a caseless [aeiou] matches "A" as
687    well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful
688    version would.
689    
690  The newline character is never treated in any special way in character classes,  The newline character is never treated in any special way in character classes,
691  whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class  whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class
692  such as [^a] will always match a newline.  such as [^a] will always match a newline.
# Line 690  such as [^a] will always match a newline Line 694  such as [^a] will always match a newline
694  The minus (hyphen) character can be used to specify a range of characters in a  The minus (hyphen) character can be used to specify a range of characters in a
695  character class. For example, [d-m] matches any letter between d and m,  character class. For example, [d-m] matches any letter between d and m,
696  inclusive. If a minus character is required in a class, it must be escaped with  inclusive. If a minus character is required in a class, it must be escaped with
697  \\ or appear in a position where it cannot be interpreted as indicating a  a backslash or appear in a position where it cannot be interpreted as
698  range, typically as the first or last character in the class. It is not  indicating a range, typically as the first or last character in the class. It
699  possible to have the character "]" as the end character of a range, since a  is not possible to have the character "]" as the end character of a range,
700  sequence such as [w-] is interpreted as a class of two characters. The octal or  since a sequence such as [w-] is interpreted as a class of two characters. The
701  hexadecimal representation of "]" can, however, be used to end a range.  octal or hexadecimal representation of "]" can, however, be used to end a
702    range.
703    
704  Ranges operate in ASCII collating sequence. They can also be used for  Ranges operate in ASCII collating sequence. They can also be used for
705  characters specified numerically, for example [\\000-\\037]. If a range such as  characters specified numerically, for example [\\000-\\037]. If a range such as
706  [W-c] is used when PCRE_CASELESS is set, it matches the letters involved in  [W-c] is used when PCRE_CASELESS is set, it matches the letters involved in
707  either case.  either case, so is equivalent to [][\\^_`wxyzabc], matched caselessly.
708    
709  The character types \\d, \\D, \\s, \\S, \\w, and \\W may also appear in a  The character types \\d, \\D, \\s, \\S, \\w, and \\W may also appear in a
710  character class, and add the characters that they match to the class. For  character class, and add the characters that they match to the class. For
711  example, the class [^\\W_] matches any letter or digit.  example, [\\dABCDEF] matches any hexadecimal digit. A circumflex can
712    conveniently be used with the upper case character types to specify a more
713    restricted set of characters than the matching lower case type. For example,
714    the class [^\\W_] matches any letter or digit, but not underscore.
715    
716  All non-alphameric characters other than \\, -, ^ (at the start) and the  All non-alphameric characters other than \\, -, ^ (at the start) and the
717  terminating ] are non-special in character classes, but it does no harm if they  terminating ] are non-special in character classes, but it does no harm if they
# Line 711  are escaped. Line 719  are escaped.
719    
720    
721  .SH VERTICAL BAR  .SH VERTICAL BAR
722  Vertical bar characters are used to separate alternative patterns. The matching  Vertical bar characters are used to separate alternative patterns. For example,
723  process tries all the alternatives in turn. For example, the pattern  the pattern
724    
725    gilbert|sullivan    gilbert|sullivan
726    
727  matches either "gilbert" or "sullivan". Any number of alternatives can be used,  matches either "gilbert" or "sullivan". Any number of alternatives may appear,
728  and an empty alternative is permitted (matching the empty string).  and an empty alternative is permitted (matching the empty string).
729    The matching process tries each alternative in turn, from left to right,
730    and the first one that succeeds is used. If the alternatives are within a
731    subpattern (defined below), "succeeds" means matching the rest of the main
732    pattern as well as the alternative in the subpattern.
733    
734    
735    .SH INTERNAL OPTION SETTING
736    The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED
737    can be changed from within the pattern by a sequence of Perl option letters
738    enclosed between "(?" and ")". The option letters are
739    
740      i  for PCRE_CASELESS
741      m  for PCRE_MULTILINE
742      s  for PCRE_DOTALL
743      x  for PCRE_EXTENDED
744    
745    For example, (?im) sets caseless, multiline matching. It is also possible to
746    unset these options by preceding the letter with a hyphen, and a combined
747    setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
748    PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
749    permitted. If a letter appears both before and after the hyphen, the option is
750    unset.
751    
752    The scope of these option changes depends on where in the pattern the setting
753    occurs. For settings that are outside any subpattern (defined below), the
754    effect is the same as if the options were set or unset at the start of
755    matching. The following patterns all behave in exactly the same way:
756    
757      (?i)abc
758      a(?i)bc
759      ab(?i)c
760      abc(?i)
761    
762    which in turn is the same as compiling the pattern abc with PCRE_CASELESS set.
763    In other words, such "top level" settings apply to the whole pattern (unless
764    there are other changes inside subpatterns). If there is more than one setting
765    of the same option at top level, the rightmost setting is used.
766    
767    If an option change occurs inside a subpattern, the effect is different. This
768    is a change of behaviour in Perl 5.005. An option change inside a subpattern
769    affects only that part of the subpattern that follows it, so
770    
771      (a(?i)b)c
772    
773    matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
774    By this means, options can be made to have different settings in different
775    parts of the pattern. Any changes made in one alternative do carry on
776    into subsequent branches within the same subpattern. For example,
777    
778      (a(?i)b|c)
779    
780    matches "ab", "aB", "c", and "C", even though when matching "C" the first
781    branch is abandoned before the option setting. This is because the effects of
782    option settings happen at compile time. There would be some very weird
783    behaviour otherwise.
784    
785    The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the
786    same way as the Perl-compatible options by using the characters U and X
787    respectively. The (?X) flag setting is special in that it must always occur
788    earlier in the pattern than any of the additional features it turns on, even
789    when it is at top level. It is best put at the start.
790    
791    
792  .SH SUBPATTERNS  .SH SUBPATTERNS
# Line 757  the captured substrings are "white queen Line 826  the captured substrings are "white queen
826  2. The maximum number of captured substrings is 99, and the maximum number of  2. The maximum number of captured substrings is 99, and the maximum number of
827  all subpatterns, both capturing and non-capturing, is 200.  all subpatterns, both capturing and non-capturing, is 200.
828    
829    As a convenient shorthand, if any option settings are required at the start of
830  .SH BACK REFERENCES  a non-capturing subpattern, the option letters may appear between the "?" and
831  Outside a character class, a backslash followed by a digit greater than 0 (and  the ":". Thus the two patterns
832  possibly further digits) is a back reference to a capturing subpattern earlier  
833  (i.e. to its left) in the pattern, provided there have been that many previous    (?i:saturday|sunday)
834  capturing left parentheses. However, if the decimal number following the    (?:(?i)saturday|sunday)
835  backslash is less than 10, it is always taken as a back reference, and causes  
836  an error if there have not been that many previous capturing left parentheses.  match exactly the same set of strings. Because alternative branches are tried
837  See the section entitled "Backslash" above for further details of the handling  from left to right, and options are not reset until the end of the subpattern
838  of digits following a backslash.  is reached, an option setting in one branch does affect subsequent branches, so
839    the above patterns match "SUNDAY" as well as "Saturday".
 A back reference matches whatever actually matched the capturing subpattern in  
 the current subject string, rather than anything matching the subpattern  
 itself. So the pattern  
   
    (sens|respons)e and \\1ibility  
   
 matches "sense and sensibility" and "response and responsibility", but not  
 "sense and responsibility".  
   
 There may be more than one back reference to the same subpattern. If a  
 subpattern has not actually been used in a particular match, then any back  
 references to it always fail. For example, the pattern  
   
   (a|(bc))\\2  
   
 always fails if it starts to match "a" rather than "bc". Because there may be  
 up to 99 back references, all digits following the backslash are taken  
 as part of a potential back reference number. If the pattern continues with a  
 digit character, then some delimiter must be used to terminate the back  
 reference. If the PCRE_EXTENDED option is set, this can be whitespace.  
 Otherwise an empty comment can be used.  
840    
841    
842  .SH REPETITION  .SH REPETITION
# Line 798  items: Line 846  items:
846    a single character, possibly escaped    a single character, possibly escaped
847    the . metacharacter    the . metacharacter
848    a character class    a character class
849    a back reference    a back reference (see next section)
850    a parenthesized subpattern    a parenthesized subpattern (unless it is an assertion - see below)
851    
852  The general repetition quantifier specifies a minimum and maximum number of  The general repetition quantifier specifies a minimum and maximum number of
853  permitted matches, by giving the two numbers in curly brackets (braces),  permitted matches, by giving the two numbers in curly brackets (braces),
# Line 821  matches at least 3 successive vowels, bu Line 869  matches at least 3 successive vowels, bu
869    
870  matches exactly 8 digits. An opening curly bracket that appears in a position  matches exactly 8 digits. An opening curly bracket that appears in a position
871  where a quantifier is not allowed, or one that does not match the syntax of a  where a quantifier is not allowed, or one that does not match the syntax of a
872  quantifier, is taken as a literal character. For example, "{,6}" is not a  quantifier, is taken as a literal character. For example, {,6} is not a
873  quantifier, but a literal string of four characters.  quantifier, but a literal string of four characters.
874    
875  The quantifier {0} is permitted, causing the expression to behave as if the  The quantifier {0} is permitted, causing the expression to behave as if the
# Line 834  quantifiers have single-character abbrev Line 882  quantifiers have single-character abbrev
882    +    is equivalent to {1,}    +    is equivalent to {1,}
883    ?    is equivalent to {0,1}    ?    is equivalent to {0,1}
884    
885    It is possible to construct infinite loops by following a subpattern that can
886    match no characters with a quantifier that has no upper limit, for example:
887    
888      (a?)*
889    
890    Earlier versions of Perl and PCRE used to give an error at compile time for
891    such patterns. However, because there are cases where this can be useful, such
892    patterns are now accepted, but if any repetition of the subpattern does in fact
893    match no characters, the loop is forcibly broken.
894    
895  By default, the quantifiers are "greedy", that is, they match as much as  By default, the quantifiers are "greedy", that is, they match as much as
896  possible (up to the maximum number of permitted times), without causing the  possible (up to the maximum number of permitted times), without causing the
897  rest of the pattern to fail. The classic example of where this gives problems  rest of the pattern to fail. The classic example of where this gives problems
# Line 861  quantifiers is not otherwise changed, ju Line 919  quantifiers is not otherwise changed, ju
919  Do not confuse this use of question mark with its use as a quantifier in its  Do not confuse this use of question mark with its use as a quantifier in its
920  own right. Because it has two uses, it can sometimes appear doubled, as in  own right. Because it has two uses, it can sometimes appear doubled, as in
921    
922     \\d??\\d    \\d??\\d
923    
924  which matches one digit by preference, but can match two if that is the only  which matches one digit by preference, but can match two if that is the only
925  way the rest of the pattern matches.  way the rest of the pattern matches.
926    
927  When a parenthesized subpattern is quantified a with minimum repeat count that  If the PCRE_UNGREEDY option is set (an option which is not available in Perl)
928    then the quantifiers are not greedy by default, but individual ones can be made
929    greedy by following them with a question mark. In other words, it inverts the
930    default behaviour.
931    
932    When a parenthesized subpattern is quantified with a minimum repeat count that
933  is greater than 1 or with a limited maximum, more store is required for the  is greater than 1 or with a limited maximum, more store is required for the
934  compiled pattern, in proportion to the size of the minimum or maximum.  compiled pattern, in proportion to the size of the minimum or maximum.
935    
# Line 875  follows will be tried against every char Line 938  follows will be tried against every char
938  PCRE treats this as though it were preceded by \\A.  PCRE treats this as though it were preceded by \\A.
939    
940  When a capturing subpattern is repeated, the value captured is the substring  When a capturing subpattern is repeated, the value captured is the substring
941  that matched the final iteration. For example,  that matched the final iteration. For example, after
942    
943      (tweedle[dume]{3}\\s*)+
944    
945    has matched "tweedledum tweedledee" the value of the captured substring is
946    "tweedledee". However, if there are nested capturing subpatterns, the
947    corresponding captured values may have been set in previous iterations. For
948    example, after
949    
950      /(a|(b))+/
951    
952    matches "aba" the value of the second captured substring is "b".
953    
954    
955    .SH BACK REFERENCES
956    Outside a character class, a backslash followed by a digit greater than 0 (and
957    possibly further digits) is a back reference to a capturing subpattern earlier
958    (i.e. to its left) in the pattern, provided there have been that many previous
959    capturing left parentheses.
960    
961    However, if the decimal number following the backslash is less than 10, it is
962    always taken as a back reference, and causes an error only if there are not
963    that many capturing left parentheses in the entire pattern. In other words, the
964    parentheses that are referenced need not be to the left of the reference for
965    numbers less than 10. See the section entitled "Backslash" above for further
966    details of the handling of digits following a backslash.
967    
968    A back reference matches whatever actually matched the capturing subpattern in
969    the current subject string, rather than anything matching the subpattern
970    itself. So the pattern
971    
972      (sens|respons)e and \\1ibility
973    
974    matches "sense and sensibility" and "response and responsibility", but not
975    "sense and responsibility". If caseful matching is in force at the time of the
976    back reference, then the case of letters is relevant. For example,
977    
978      ((?i)rah)\\s+\\1
979    
980    matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
981    capturing subpattern is matched caselessly.
982    
983    There may be more than one back reference to the same subpattern. If a
984    subpattern has not actually been used in a particular match, then any back
985    references to it always fail. For example, the pattern
986    
987      (a|(bc))\\2
988    
989     (\s*tweedle[dume]{3})+\\1  always fails if it starts to match "a" rather than "bc". Because there may be
990    up to 99 back references, all digits following the backslash are taken
991    as part of a potential back reference number. If the pattern continues with a
992    digit character, then some delimiter must be used to terminate the back
993    reference. If the PCRE_EXTENDED option is set, this can be whitespace.
994    Otherwise an empty comment can be used.
995    
996  matches "tweedledum tweedledee tweedledee" but not "tweedledum tweedledee  A back reference that occurs inside the parentheses to which it refers fails
997  tweedledum".  when the subpattern is first used, so, for example, (a\\1) never matches.
998    However, such references can be useful inside repeated subpatterns. For
999    example, the pattern
1000    
1001      (a|b\\1)+
1002    
1003    matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of
1004    the subpattern, the back reference matches the character string corresponding
1005    to the previous iteration. In order for this to work, the pattern must be such
1006    that the first iteration does not need to match the back reference. This can be
1007    done using alternation, as in the example above, or by a quantifier with a
1008    minimum of zero.
1009    
1010    
1011  .SH ASSERTIONS  .SH ASSERTIONS
1012  An assertion is a test on the characters following the current matching point  An assertion is a test on the characters following or preceding the current
1013  that does not actually consume any of those characters. The simple assertions  matching point that does not actually consume any characters. The simple
1014  coded as \\b, \\B, \\A, \\Z, ^ and $ are described above. More complicated  assertions coded as \\b, \\B, \\A, \\Z, \\z, ^ and $ are described above. More
1015  assertions are coded as subpatterns starting with (?= for positive assertions,  complicated assertions are coded as subpatterns. There are two kinds: those
1016  and (?! for negative assertions. For example,  that look ahead of the current position in the subject string, and those that
1017    look behind it.
1018    
1019    An assertion subpattern is matched in the normal way, except that it does not
1020    cause the current matching position to be changed. Lookahead assertions start
1021    with (?= for positive assertions and (?! for negative assertions. For example,
1022    
1023    \\w+(?=;)    \\w+(?=;)
1024    
# Line 904  apparently similar pattern Line 1034  apparently similar pattern
1034    
1035  does not find an occurrence of "bar" that is preceded by something other than  does not find an occurrence of "bar" that is preceded by something other than
1036  "foo"; it finds any occurrence of "bar" whatsoever, because the assertion  "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
1037  (?!foo) is always true when the next three characters are "bar".  (?!foo) is always true when the next three characters are "bar". A
1038    lookbehind assertion is needed to achieve this effect.
1039    
1040    Lookbehind assertions start with (?<= for positive assertions and (?<! for
1041    negative assertions. For example,
1042    
1043      (?<!foo)bar
1044    
1045    does find an occurrence of "bar" that is not preceded by "foo". The contents of
1046    a lookbehind assertion are restricted such that all the strings it matches must
1047    have a fixed length. However, if there are several alternatives, they do not
1048    all have to have the same fixed length. Thus
1049    
1050      (?<=bullock|donkey)
1051    
1052    is permitted, but
1053    
1054      (?<!dogs?|cats?)
1055    
1056    causes an error at compile time. Branches that match different length strings
1057    are permitted only at the top level of a lookbehind assertion. This is an
1058    extension compared with Perl 5.005, which requires all branches to match the
1059    same length of string. An assertion such as
1060    
1061      (?<=ab(c|de))
1062    
1063    is not permitted, because its single branch can match two different lengths,
1064    but it is acceptable if rewritten to use two branches:
1065    
1066      (?<=abc|abde)
1067    
1068    The implementation of lookbehind assertions is, for each alternative, to
1069    temporarily move the current position back by the fixed width and then try to
1070    match. If there are insufficient characters before the current position, the
1071    match is deemed to fail.
1072    
1073    Assertions can be nested in any combination. For example,
1074    
1075      (?<=(?<!foo)bar)baz
1076    
1077    matches an occurrence of "baz" that is preceded by "bar" which in turn is not
1078    preceded by "foo".
1079    
1080  Assertion subpatterns are not capturing subpatterns, and may not be repeated,  Assertion subpatterns are not capturing subpatterns, and may not be repeated,
1081  because it makes no sense to assert the same thing several times. If an  because it makes no sense to assert the same thing several times. If an
# Line 917  Assertions count towards the maximum of Line 1088  Assertions count towards the maximum of
1088    
1089    
1090  .SH ONCE-ONLY SUBPATTERNS  .SH ONCE-ONLY SUBPATTERNS
 The facility described in this section is available only when the PCRE_EXTRA  
 option is set at compile time. It is an extension to Perl regular expressions.  
   
1091  With both maximizing and minimizing repetition, failure of what follows  With both maximizing and minimizing repetition, failure of what follows
1092  normally causes the repeated item to be re-evaluated to see if a different  normally causes the repeated item to be re-evaluated to see if a different
1093  number of repeats allows the rest of the pattern to match. Sometimes it is  number of repeats allows the rest of the pattern to match. Sometimes it is
1094  useful to prevent this, either to change the nature of the match, or to cause  useful to prevent this, either to change the nature of the match, or to cause
1095  it fail earlier than it otherwise might when the author or the pattern knows  it fail earlier than it otherwise might, when the author of the pattern knows
1096  there is no point in carrying on.  there is no point in carrying on.
1097    
1098  Consider, for example, the pattern \\d+foo when applied to the subject line  Consider, for example, the pattern \\d+foo when applied to the subject line
1099    
1100     123456bar    123456bar
1101    
1102  After matching all 6 digits and then failing to match "foo", the normal  After matching all 6 digits and then failing to match "foo", the normal
1103  action of the matcher is to try again with only 5 digits matching the \\d+  action of the matcher is to try again with only 5 digits matching the \\d+
# Line 939  has matched, it is not to be re-evaluate Line 1107  has matched, it is not to be re-evaluate
1107  give up immediately on failing to match "foo" the first time. The notation is  give up immediately on failing to match "foo" the first time. The notation is
1108  another kind of special parenthesis, starting with (?> as in this example:  another kind of special parenthesis, starting with (?> as in this example:
1109    
1110    (?>\d+)bar    (?>\\d+)bar
1111    
1112  This kind of parenthesis "locks up" the  part of the pattern it contains once  This kind of parenthesis "locks up" the  part of the pattern it contains once
1113  it has matched, and a failure further into the pattern is prevented from  it has matched, and a failure further into the pattern is prevented from
1114  backtracking into it. Backtracking past it to previous items, however, works as  backtracking into it. Backtracking past it to previous items, however, works as
1115  normal.  normal.
1116    
1117  For simple cases such as the above example, this feature can be though of as a  An alternative description is that a subpattern of this type matches the string
1118  maximizing repeat that must swallow everything it can. So, while both \\d+ and  of characters that an identical standalone pattern would match, if anchored at
1119  \\d+? are prepared to adjust the number of digits they match in order to make  the current point in the subject string.
1120  the rest of the pattern match, (?>\\d+) can only match an entire sequence of  
1121  digits.  Once-only subpatterns are not capturing subpatterns. Simple cases such as the
1122    above example can be though of as a maximizing repeat that must swallow
1123    everything it can. So, while both \\d+ and \\d+? are prepared to adjust the
1124    number of digits they match in order to make the rest of the pattern match,
1125    (?>\\d+) can only match an entire sequence of digits.
1126    
1127  This construction can of course contain arbitrarily complicated subpatterns,  This construction can of course contain arbitrarily complicated subpatterns,
1128  and it can be nested. Contrast with the \\X assertion, which is a Prolog-like  and it can be nested.
1129  "cut".  
1130    
1131    .SH CONDITIONAL SUBPATTERNS
1132    It is possible to cause the matching process to obey a subpattern
1133    conditionally or to choose between two alternative subpatterns, depending on
1134    the result of an assertion, or whether a previous capturing subpattern matched
1135    or not. The two possible forms of conditional subpattern are
1136    
1137      (?(condition)yes-pattern)
1138      (?(condition)yes-pattern|no-pattern)
1139    
1140    If the condition is satisfied, the yes-pattern is used; otherwise the
1141    no-pattern (if present) is used. If there are more than two alternatives in the
1142    subpattern, a compile-time error occurs.
1143    
1144    There are two kinds of condition. If the text between the parentheses consists
1145    of a sequence of digits, then the condition is satisfied if the capturing
1146    subpattern of that number has previously matched. Consider the following
1147    pattern, which contains non-significant white space to make it more readable
1148    (assume the PCRE_EXTENDED option) and to divide it into three parts for ease
1149    of discussion:
1150    
1151      ( \\( )?    [^()]+    (?(1) \\) )
1152    
1153    The first part matches an optional opening parenthesis, and if that
1154    character is present, sets it as the first captured substring. The second part
1155    matches one or more characters that are not parentheses. The third part is a
1156    conditional subpattern that tests whether the first set of parentheses matched
1157    or not. If they did, that is, if subject started with an opening parenthesis,
1158    the condition is true, and so the yes-pattern is executed and a closing
1159    parenthesis is required. Otherwise, since no-pattern is not present, the
1160    subpattern matches nothing. In other words, this pattern matches a sequence of
1161    non-parentheses, optionally enclosed in parentheses.
1162    
1163    If the condition is not a sequence of digits, it must be an assertion. This may
1164    be a positive or negative lookahead or lookbehind assertion. Consider this
1165    pattern, again containing non-significant white space, and with the two
1166    alternatives on the second line:
1167    
1168      (?(?=[^a-z]*[a-z])
1169      \\d{2}[a-z]{3}-\\d{2}  |  \\d{2}-\\d{2}-\\d{2} )
1170    
1171    The condition is a positive lookahead assertion that matches an optional
1172    sequence of non-letters followed by a letter. In other words, it tests for the
1173    presence of at least one letter in the subject. If a letter is found, the
1174    subject is matched against the first alternative; otherwise it is matched
1175    against the second. This pattern matches strings in one of the two forms
1176    dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
1177    
1178    
1179  .SH COMMENTS  .SH COMMENTS
# Line 967  character class introduces a comment tha Line 1186  character class introduces a comment tha
1186  character in the pattern.  character in the pattern.
1187    
1188    
 .SH INTERNAL FLAG SETTING  
 If the sequence (?i) occurs anywhere in a pattern, it has the effect of setting  
 the PCRE_CASELESS option, that is, all letters are matched in a  
 case-independent manner. The option applies to the whole pattern, not just to  
 the portion that follows it.  
   
 If the sequence (?m) occurs anywhere in a pattern, it has the effect of setting  
 the PCRE_MULTILINE option, that is, subject strings matched by this pattern are  
 treated as consisting of multiple lines.  
   
 If the sequence (?s) occurs anywhere in a pattern, it has the effect of setting  
 the PCRE_DOTALL option, so that dot metacharacters match newlines as well as  
 all other characters.  
   
 If the sequence (?x) occurs anywhere in a pattern, it has the effect of setting  
 the PCRE_EXTENDED option, that is, whitespace is ignored and # introduces a  
 comment that lasts till the next newline. The option applies to the whole  
 pattern, not just to the portion that follows it.  
   
 If more than one option is required, they can be specified jointly, for example  
 as (?ix) or (?mi).  
   
   
1189  .SH PERFORMANCE  .SH PERFORMANCE
1190  Certain items that may appear in patterns are more efficient than others. It is  Certain items that may appear in patterns are more efficient than others. It is
1191  more efficient to use a character class like [aeiou] than a set of alternatives  more efficient to use a character class like [aeiou] than a set of alternatives
# Line 998  required behaviour is usually the most e Line 1194  required behaviour is usually the most e
1194  contains a lot of discussion about optimizing regular expressions for efficient  contains a lot of discussion about optimizing regular expressions for efficient
1195  performance.  performance.
1196    
 The use of PCRE_MULTILINE causes additional processing and should be avoided  
 when it is not necessary. Caseless matching of character classes is more  
 efficient if PCRE_CASELESS is set when the pattern is compiled.  
   
1197    
1198  .SH AUTHOR  .SH AUTHOR
1199  Philip Hazel <ph10@cam.ac.uk>  Philip Hazel <ph10@cam.ac.uk>
# Line 1014  Cambridge CB2 3QG, England. Line 1206  Cambridge CB2 3QG, England.
1206  .br  .br
1207  Phone: +44 1223 334714  Phone: +44 1223 334714
1208    
1209  Copyright (c) 1997 University of Cambridge.  Copyright (c) 1998 University of Cambridge.

Legend:
Removed from v.3  
changed lines
  Added in v.23

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12