/[pcre]/code/trunk/doc/pcreperform.3
ViewVC logotype

Diff of /code/trunk/doc/pcreperform.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 92 by nigel, Sat Feb 24 21:40:52 2007 UTC revision 93 by nigel, Sat Feb 24 21:41:42 2007 UTC
# Line 4  PCRE - Perl-compatible regular expressio Line 4  PCRE - Perl-compatible regular expressio
4  .SH "PCRE PERFORMANCE"  .SH "PCRE PERFORMANCE"
5  .rs  .rs
6  .sp  .sp
7  Certain items that may appear in regular expression patterns are more efficient  Two aspects of performance are discussed below: memory usage and processing
8    time. The way you express your pattern as a regular expression can affect both
9    of them.
10    .
11    .SH "MEMORY USAGE"
12    .rs
13    .sp
14    Patterns are compiled by PCRE into a reasonably efficient byte code, so that
15    most simple patterns do not use much memory. However, there is one case where
16    memory usage can be unexpectedly large. When a parenthesized subpattern has a
17    quantifier with a minimum greater than 1 and/or a limited maximum, the whole
18    subpattern is repeated in the compiled code. For example, the pattern
19    .sp
20      (abc|def){2,4}
21    .sp
22    is compiled as if it were
23    .sp
24      (abc|def)(abc|def)((abc|def)(abc|def)?)?
25    .sp
26    (Technical aside: It is done this way so that backtrack points within each of
27    the repetitions can be independently maintained.)
28    .P
29    For regular expressions whose quantifiers use only small numbers, this is not
30    usually a problem. However, if the numbers are large, and particularly if such
31    repetitions are nested, the memory usage can become an embarrassment. For
32    example, the very simple pattern
33    .sp
34      ((ab){1,1000}c){1,3}
35    .sp
36    uses 51K bytes when compiled. When PCRE is compiled with its default internal
37    pointer size of two bytes, the size limit on a compiled pattern is 64K, and
38    this is reached with the above pattern if the outer repetition is increased
39    from 3 to 4. PCRE can be compiled to use larger internal pointers and thus
40    handle larger compiled patterns, but it is better to try to rewrite your
41    pattern to use less memory if you can.
42    .P
43    One way of reducing the memory usage for such patterns is to make use of PCRE's
44    .\" HTML <a href="pcrepattern.html#subpatternsassubroutines">
45    .\" </a>
46    "subroutine"
47    .\"
48    facility. Re-writing the above pattern as
49    .sp
50      ((ab)(?2){0,999}c)(?1){0,2}
51    .sp
52    reduces the memory requirements to 18K, and indeed it remains under 20K even
53    with the outer repetition increased to 100. However, this pattern is not
54    exactly equivalent, because the "subroutine" calls are treated as
55    .\" HTML <a href="pcrepattern.html#atomicgroup">
56    .\" </a>
57    atomic groups
58    .\"
59    into which there can be no backtracking if there is a subsequent matching
60    failure. Therefore, PCRE cannot do this kind of rewriting automatically.
61    Furthermore, there is a noticeable loss of speed when executing the modified
62    pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
63    speed is acceptable, this kind of rewriting will allow you to process patterns
64    that PCRE cannot otherwise handle.
65    .
66    .SH "PROCESSING TIME"
67    .rs
68    .sp
69    Certain items in regular expression patterns are processed more efficiently
70  than others. It is more efficient to use a character class like [aeiou] than a  than others. It is more efficient to use a character class like [aeiou] than a
71  set of alternatives such as (a|e|i|o|u). In general, the simplest construction  set of single-character alternatives such as (a|e|i|o|u). In general, the
72  that provides the required behaviour is usually the most efficient. Jeffrey  simplest construction that provides the required behaviour is usually the most
73  Friedl's book contains a lot of useful general discussion about optimizing  efficient. Jeffrey Friedl's book contains a lot of useful general discussion
74  regular expressions for efficient performance. This document contains a few  about optimizing regular expressions for efficient performance. This document
75  observations about PCRE.  contains a few observations about PCRE.
76  .P  .P
77  Using Unicode character properties (the \ep, \eP, and \eX escapes) is slow,  Using Unicode character properties (the \ep, \eP, and \eX escapes) is slow,
78  because PCRE has to scan a structure that contains data for over fifteen  because PCRE has to scan a structure that contains data for over fifteen
# Line 42  Beware of patterns that contain nested i Line 104  Beware of patterns that contain nested i
104  long time to run when applied to a string that does not match. Consider the  long time to run when applied to a string that does not match. Consider the
105  pattern fragment  pattern fragment
106  .sp  .sp
107    (a+)*    ^(a+)*
108  .sp  .sp
109  This can match "aaaa" in 33 different ways, and this number increases very  This can match "aaaa" in 16 different ways, and this number increases very
110  rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4  rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
111  times, and for each of those cases other than 0, the + repeats can match  times, and for each of those cases other than 0 or 4, the + repeats can match
112  different numbers of times.) When the remainder of the pattern is such that the  different numbers of times.) When the remainder of the pattern is such that the
113  entire match is going to fail, PCRE has in principle to try every possible  entire match is going to fail, PCRE has in principle to try every possible
114  variation, and this can take an extremely long time.  variation, and this can take an extremely long time, even for relatively short
115    strings.
116  .P  .P
117  An optimization catches some of the more simple cases such as  An optimization catches some of the more simple cases such as
118  .sp  .sp
# Line 71  In many cases, the solution to this kind Line 134  In many cases, the solution to this kind
134  atomic group or a possessive quantifier.  atomic group or a possessive quantifier.
135  .P  .P
136  .in 0  .in 0
137  Last updated: 28 February 2005  Last updated: 20 September 2006
138  .br  .br
139  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.

Legend:
Removed from v.92  
changed lines
  Added in v.93

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12