/[pcre]/code/trunk/doc/pcreperform.3
ViewVC logotype

Contents of /code/trunk/doc/pcreperform.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 859 - (hide annotations) (download)
Mon Jan 9 17:43:54 2012 UTC (16 months, 1 week ago) by ph10
File size: 7283 byte(s)
Documentation.

1 nigel 79 .TH PCREPERFORM 3
2 nigel 63 .SH NAME
3     PCRE - Perl-compatible regular expressions
4 nigel 75 .SH "PCRE PERFORMANCE"
5 nigel 63 .rs
6     .sp
7 nigel 93 Two aspects of performance are discussed below: memory usage and processing
8     time. The way you express your pattern as a regular expression can affect both
9     of them.
10     .
11 ph10 502 .SH "COMPILED PATTERN MEMORY USAGE"
12 nigel 93 .rs
13     .sp
14 ph10 859 Patterns are compiled by PCRE into a reasonably efficient interpretive code, so
15     that most simple patterns do not use much memory. However, there is one case
16     where the memory usage of a compiled pattern can be unexpectedly large. If a
17 ph10 502 parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
18     a limited maximum, the whole subpattern is repeated in the compiled code. For
19     example, the pattern
20 nigel 93 .sp
21     (abc|def){2,4}
22     .sp
23     is compiled as if it were
24     .sp
25     (abc|def)(abc|def)((abc|def)(abc|def)?)?
26     .sp
27     (Technical aside: It is done this way so that backtrack points within each of
28     the repetitions can be independently maintained.)
29     .P
30     For regular expressions whose quantifiers use only small numbers, this is not
31     usually a problem. However, if the numbers are large, and particularly if such
32     repetitions are nested, the memory usage can become an embarrassment. For
33     example, the very simple pattern
34     .sp
35     ((ab){1,1000}c){1,3}
36     .sp
37 ph10 859 uses 51K bytes when compiled using the 8-bit library. When PCRE is compiled
38     with its default internal pointer size of two bytes, the size limit on a
39     compiled pattern is 64K data units, and this is reached with the above pattern
40     if the outer repetition is increased from 3 to 4. PCRE can be compiled to use
41     larger internal pointers and thus handle larger compiled patterns, but it is
42     better to try to rewrite your pattern to use less memory if you can.
43 nigel 93 .P
44     One way of reducing the memory usage for such patterns is to make use of PCRE's
45     .\" HTML <a href="pcrepattern.html#subpatternsassubroutines">
46     .\" </a>
47     "subroutine"
48     .\"
49     facility. Re-writing the above pattern as
50     .sp
51     ((ab)(?2){0,999}c)(?1){0,2}
52     .sp
53     reduces the memory requirements to 18K, and indeed it remains under 20K even
54     with the outer repetition increased to 100. However, this pattern is not
55     exactly equivalent, because the "subroutine" calls are treated as
56     .\" HTML <a href="pcrepattern.html#atomicgroup">
57     .\" </a>
58     atomic groups
59     .\"
60     into which there can be no backtracking if there is a subsequent matching
61     failure. Therefore, PCRE cannot do this kind of rewriting automatically.
62     Furthermore, there is a noticeable loss of speed when executing the modified
63     pattern. Nevertheless, if the atomic grouping is not a problem and the loss of
64     speed is acceptable, this kind of rewriting will allow you to process patterns
65     that PCRE cannot otherwise handle.
66     .
67 ph10 502 .
68     .SH "STACK USAGE AT RUN TIME"
69     .rs
70     .sp
71 ph10 859 When \fBpcre_exec()\fP or \fBpcre16_exec()\fP is used for matching, certain
72     kinds of pattern can cause it to use large amounts of the process stack. In
73     some environments the default process stack is quite small, and if it runs out
74     the result is often SIGSEGV. This issue is probably the most frequently raised
75     problem with PCRE. Rewriting your pattern can often help. The
76 ph10 502 .\" HREF
77     \fBpcrestack\fP
78     .\"
79     documentation discusses this issue in detail.
80     .
81     .
82 nigel 93 .SH "PROCESSING TIME"
83     .rs
84     .sp
85     Certain items in regular expression patterns are processed more efficiently
86 nigel 63 than others. It is more efficient to use a character class like [aeiou] than a
87 nigel 93 set of single-character alternatives such as (a|e|i|o|u). In general, the
88     simplest construction that provides the required behaviour is usually the most
89     efficient. Jeffrey Friedl's book contains a lot of useful general discussion
90     about optimizing regular expressions for efficient performance. This document
91     contains a few observations about PCRE.
92 nigel 75 .P
93     Using Unicode character properties (the \ep, \eP, and \eX escapes) is slow,
94     because PCRE has to scan a structure that contains data for over fifteen
95     thousand characters whenever it needs a character's property. If you can find
96     an alternative pattern that does not use character properties, it will probably
97     be faster.
98     .P
99 ph10 535 By default, the escape sequences \eb, \ed, \es, and \ew, and the POSIX
100     character classes such as [:alpha:] do not use Unicode properties, partly for
101     backwards compatibility, and partly for performance reasons. However, you can
102     set PCRE_UCP if you want Unicode character properties to be used. This can
103     double the matching time for items such as \ed, when matched with
104 ph10 859 a traditional matching function; the performance loss is less with
105     a DFA matching function, and in both cases there is not much difference for
106     \eb.
107 ph10 518 .P
108 nigel 63 When a pattern begins with .* not in parentheses, or in parentheses that are
109     not the subject of a backreference, and the PCRE_DOTALL option is set, the
110     pattern is implicitly anchored by PCRE, since it can match only at the start of
111     a subject string. However, if PCRE_DOTALL is not set, PCRE cannot make this
112     optimization, because the . metacharacter does not then match a newline, and if
113     the subject string contains newlines, the pattern may match from the character
114     immediately following one of them instead of from the very start. For example,
115     the pattern
116 nigel 75 .sp
117 nigel 63 .*second
118 nigel 75 .sp
119     matches the subject "first\enand second" (where \en stands for a newline
120 nigel 63 character), with the match starting at the seventh character. In order to do
121     this, PCRE has to retry the match starting after every newline in the subject.
122 nigel 75 .P
123 nigel 63 If you are using such a pattern with subject strings that do not contain
124     newlines, the best performance is obtained by setting PCRE_DOTALL, or starting
125 nigel 77 the pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE
126     from having to scan along the subject looking for a newline to restart at.
127 nigel 75 .P
128 nigel 63 Beware of patterns that contain nested indefinite repeats. These can take a
129     long time to run when applied to a string that does not match. Consider the
130     pattern fragment
131 nigel 75 .sp
132 nigel 93 ^(a+)*
133 nigel 75 .sp
134 nigel 93 This can match "aaaa" in 16 different ways, and this number increases very
135 nigel 63 rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
136 nigel 93 times, and for each of those cases other than 0 or 4, the + repeats can match
137 nigel 63 different numbers of times.) When the remainder of the pattern is such that the
138     entire match is going to fail, PCRE has in principle to try every possible
139 nigel 93 variation, and this can take an extremely long time, even for relatively short
140     strings.
141 nigel 75 .P
142 nigel 63 An optimization catches some of the more simple cases such as
143 nigel 75 .sp
144 nigel 63 (a+)*b
145 nigel 75 .sp
146 nigel 63 where a literal character follows. Before embarking on the standard matching
147     procedure, PCRE checks that there is a "b" later in the subject string, and if
148     there is not, it fails the match immediately. However, when there is no
149     following literal this optimization cannot be used. You can see the difference
150     by comparing the behaviour of
151 nigel 75 .sp
152     (a+)*\ed
153     .sp
154 nigel 63 with the pattern above. The former gives a failure almost instantly when
155     applied to a whole line of "a" characters, whereas the latter takes an
156     appreciable time with strings longer than about 20 characters.
157 nigel 75 .P
158     In many cases, the solution to this kind of performance issue is to use an
159     atomic group or a possessive quantifier.
160 ph10 99 .
161     .
162     .SH AUTHOR
163     .rs
164     .sp
165     .nf
166     Philip Hazel
167     University Computing Service
168     Cambridge CB2 3QH, England.
169     .fi
170     .
171     .
172     .SH REVISION
173     .rs
174     .sp
175     .nf
176 ph10 859 Last updated: 09 January 2012
177     Copyright (c) 1997-2012 University of Cambridge.
178 ph10 99 .fi

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12