/[pcre]/code/trunk/pcre.3
ViewVC logotype

Diff of /code/trunk/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 32 by nigel, Sat Feb 24 21:38:53 2007 UTC revision 33 by nigel, Sat Feb 24 21:39:01 2007 UTC
# Line 264  negative numbers: Line 264  negative numbers:
264    PCRE_ERROR_BADMAGIC   the "magic number" was not found    PCRE_ERROR_BADMAGIC   the "magic number" was not found
265    
266  If the \fIoptptr\fR argument is not NULL, a copy of the options with which the  If the \fIoptptr\fR argument is not NULL, a copy of the options with which the
267  pattern was compiled is placed in the integer it points to.  pattern was compiled is placed in the integer it points to. These option bits
268    are those specified in the call to \fBpcre_compile()\fR, modified by any
269    top-level option settings within the pattern itself, and with the PCRE_ANCHORED
270    bit set if the form of the pattern implies that it can match only at the start
271    of a subject string.
272    
273    If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL,
274    it is used to pass back information about the first character of any matched
275    string. If there is a fixed first character, e.g. from a pattern such as
276    (cat|cow|coyote), then it is returned in the integer pointed to by
277    \fIfirstcharptr\fR. Otherwise, if either
278    
279  If the \fIfirstcharptr\fR argument is not NULL, is is used to pass back    (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
280  information about the first character of any matched string. If there is a        starts with "^", or
281  fixed first character, e.g. from a pattern such as (cat|cow|coyote), then it is  
282  returned in the integer pointed to by \fIfirstcharptr\fR. Otherwise, if the    (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
283  pattern was compiled with the PCRE_MULTILINE option, and every branch started        (if it were set, the pattern would be anchored),
284  with "^", then -1 is returned, indicating that the pattern will match at the  
285    then -1 is returned, indicating that the pattern matches only at the
286  start of a subject string or after any "\\n" within the string. Otherwise -2 is  start of a subject string or after any "\\n" within the string. Otherwise -2 is
287  returned.  returned.
288    
# Line 1050  When a parenthesized subpattern is quant Line 1061  When a parenthesized subpattern is quant
1061  is greater than 1 or with a limited maximum, more store is required for the  is greater than 1 or with a limited maximum, more store is required for the
1062  compiled pattern, in proportion to the size of the minimum or maximum.  compiled pattern, in proportion to the size of the minimum or maximum.
1063    
1064  If a pattern starts with .* then it is implicitly anchored, since whatever  If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
1065  follows will be tried against every character position in the subject string.  to Perl's /s) is set, thus allowing the . to match newlines, then the pattern
1066  PCRE treats this as though it were preceded by \\A.  is implicitly anchored, because whatever follows will be tried against every
1067    character position in the subject string, so there is no point in retrying the
1068    overall match at any position after the first. PCRE treats such a pattern as
1069    though it were preceded by \\A. In cases where it is known that the subject
1070    string contains no newlines, it is worth setting PCRE_DOTALL when the pattern
1071    begins with .* in order to obtain this optimization, or alternatively using ^
1072    to indicate anchoring explicitly.
1073    
1074  When a capturing subpattern is repeated, the value captured is the substring  When a capturing subpattern is repeated, the value captured is the substring
1075  that matched the final iteration. For example, after  that matched the final iteration. For example, after
# Line 1262  proceeds from left to right, PCRE will l Line 1279  proceeds from left to right, PCRE will l
1279  then see if what follows matches the rest of the pattern. If the pattern is  then see if what follows matches the rest of the pattern. If the pattern is
1280  specified as  specified as
1281    
1282    .*abcd$    ^.*abcd$
1283    
1284  then the initial .* matches the entire string at first, but when this fails, it  then the initial .* matches the entire string at first, but when this fails, it
1285  backtracks to match all but the last character, then all but the last two  backtracks to match all but the last character, then all but the last two
# Line 1270  characters, and so on. Once again the se Line 1287  characters, and so on. Once again the se
1287  from right to left, so we are no better off. However, if the pattern is written  from right to left, so we are no better off. However, if the pattern is written
1288  as  as
1289    
1290    (?>.*)(?<=abcd)    ^(?>.*)(?<=abcd)
1291    
1292  then there can be no backtracking for the .* item; it can match only the entire  then there can be no backtracking for the .* item; it can match only the entire
1293  string. The subsequent lookbehind assertion does a single test on the last four  string. The subsequent lookbehind assertion does a single test on the last four
# Line 1344  required behaviour is usually the most e Line 1361  required behaviour is usually the most e
1361  contains a lot of discussion about optimizing regular expressions for efficient  contains a lot of discussion about optimizing regular expressions for efficient
1362  performance.  performance.
1363    
1364    When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is
1365    implicitly anchored by PCRE, since it can match only at the start of a subject
1366    string. However, if PCRE_DOTALL is not set, PCRE cannot make this optimization,
1367    because the . metacharacter does not then match a newline, and if the subject
1368    string contains newlines, the pattern may match from the character immediately
1369    following one of them instead of from the very start. For example, the pattern
1370    
1371       (.*) second
1372    
1373    matches the subject "first\\nand second" (where \\n stands for a newline
1374    character) with the first captured substring being "and". In order to do this,
1375    PCRE has to retry the match starting after every newline in the subject.
1376    
1377    If you are using such a pattern with subject strings that do not contain
1378    newlines, the best performance is obtained by setting PCRE_DOTALL, or starting
1379    the pattern with ^.* to indicate explicit anchoring. That saves PCRE from
1380    having to scan along the subject looking for a newline to restart at.
1381    
1382  .SH AUTHOR  .SH AUTHOR
1383  Philip Hazel <ph10@cam.ac.uk>  Philip Hazel <ph10@cam.ac.uk>

Legend:
Removed from v.32  
changed lines
  Added in v.33

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12