ViewVC logotype

Contents of /code/trunk/doc/pcrestack.3

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1055 - (hide annotations) (download)
Tue Oct 16 15:53:30 2012 UTC (2 years, 6 months ago) by chpe
File size: 8901 byte(s)
pcre32: Add 32-bit library

Create libpcre32 that operates on 32-bit characters (UTF-32).

This turned out to be surprisingly simple after the UTF-16 support
was introduced; mostly just extra ifdefs and adjusting and adding
some tests.
1 chpe 1055 .TH PCRESTACK 3 "24 June 2012" "PCRE 8.30"
2 nigel 91 .SH NAME
3     PCRE - Perl-compatible regular expressions
5     .rs
6     .sp
7 chpe 1055 When you call \fBpcre[16|32]_exec()\fP, it makes use of an internal function
8 ph10 861 called \fBmatch()\fP. This calls itself recursively at branch points in the
9     pattern, in order to remember the state of the match so that it can back up and
10     try a different alternative if the first one fails. As matching proceeds deeper
11     and deeper into the tree of possibilities, the recursion depth increases. The
12 ph10 654 \fBmatch()\fP function is also called in other circumstances, for example,
13 ph10 631 whenever a parenthesized sub-pattern is entered, and in certain cases of
14     repetition.
15 nigel 91 .P
16     Not all calls of \fBmatch()\fP increase the recursion depth; for an item such
17     as a* it may be called several times at the same level, after matching
18     different numbers of a's. Furthermore, in a number of cases where the result of
19     the recursive call would immediately be passed back as the result of the
20     current call (a "tail recursion"), the function is just restarted instead.
21     .P
22 chpe 1055 The above comments apply when \fBpcre[16|32]_exec()\fP is run in its normal
23 ph10 691 interpretive manner. If the pattern was studied with the
24     PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling was successful, and
25 chpe 1055 the options passed to \fBpcre[16|32]_exec()\fP were not incompatible, the matching
26 ph10 691 process uses the JIT-compiled code instead of the \fBmatch()\fP function. In
27 ph10 678 this case, the memory requirements are handled entirely differently. See the
28     .\" HREF
29     \fBpcrejit\fP
30     .\"
31     documentation for details.
32     .P
33 chpe 1055 The \fBpcre[16|32]_dfa_exec()\fP function operates in an entirely different way,
34 ph10 861 and uses recursion only when there is a regular expression recursion or
35     subroutine call in the pattern. This includes the processing of assertion and
36     "once-only" subpatterns, which are handled like subroutine calls. Normally,
37     these are never very deep, and the limit on the complexity of
38 chpe 1055 \fBpcre[16|32]_dfa_exec()\fP is controlled by the amount of workspace it is given.
39 ph10 861 However, it is possible to write patterns with runaway infinite recursions;
40 chpe 1055 such patterns will cause \fBpcre[16|32]_dfa_exec()\fP to run out of stack. At
41 ph10 861 present, there is no protection against this.
42 nigel 91 .P
43 chpe 1055 The comments that follow do NOT apply to \fBpcre[16|32]_dfa_exec()\fP; they are
44     relevant only for \fBpcre[16|32]_exec()\fP without the JIT optimization.
45 ph10 480 .
46     .
47 chpe 1055 .SS "Reducing \fBpcre[16|32]_exec()\fP's stack usage"
48 ph10 480 .rs
49     .sp
50 nigel 91 Each time that \fBmatch()\fP is actually called recursively, it uses memory
51     from the process stack. For certain kinds of pattern and data, very large
52     amounts of stack may be needed, despite the recognition of "tail recursion".
53     You can often reduce the amount of recursion, and therefore the amount of stack
54     used, by modifying the pattern that is being matched. Consider, for example,
55     this pattern:
56     .sp
57     ([^<]|<(?!inet))+
58     .sp
59     It matches from wherever it starts until it encounters "<inet" or the end of
60     the data, and is the kind of pattern that might be used when processing an XML
61     file. Each iteration of the outer parentheses matches either one character that
62     is not "<" or a "<" that is not followed by "inet". However, each time a
63     parenthesis is processed, a recursion occurs, so this formulation uses a stack
64     frame for each matched character. For a long string, a lot of stack is
65     required. Consider now this rewritten pattern, which matches exactly the same
66     strings:
67     .sp
68 ph10 122 ([^<]++|<(?!inet))+
69 nigel 91 .sp
70     This uses very much less stack, because runs of characters that do not contain
71     "<" are "swallowed" in one item inside the parentheses. Recursion happens only
72     when a "<" character that is not followed by "inet" is encountered (and we
73     assume this is relatively rare). A possessive quantifier is used to stop any
74     backtracking into the runs of non-"<" characters, but that is not related to
75     stack usage.
76     .P
77 nigel 93 This example shows that one way of avoiding stack problems when matching long
78     subject strings is to write repeated parenthesized subpatterns to match more
79     than one character whenever possible.
80 ph10 358 .
81 ph10 480 .
82 chpe 1055 .SS "Compiling PCRE to use heap instead of stack for \fBpcre[16|32]_exec()\fP"
83 ph10 358 .rs
84     .sp
85 nigel 91 In environments where stack memory is constrained, you might want to compile
86 ph10 480 PCRE to use heap memory instead of stack for remembering back-up points when
87 chpe 1055 \fBpcre[16|32]_exec()\fP is running. This makes it run a lot more slowly, however.
88 ph10 480 Details of how to do this are given in the
89 nigel 91 .\" HREF
90     \fBpcrebuild\fP
91     .\"
92 ph10 174 documentation. When built in this way, instead of using the stack, PCRE obtains
93     and frees memory by calling the functions that are pointed to by the
94 chpe 1055 \fBpcre[16|32]_stack_malloc\fP and \fBpcre[16|32]_stack_free\fP variables. By
95 ph10 861 default, these point to \fBmalloc()\fP and \fBfree()\fP, but you can replace
96     the pointers to cause PCRE to use your own functions. Since the block sizes are
97     always the same, and are always freed in reverse order, it may be possible to
98     implement customized memory handlers that are more efficient than the standard
99     functions.
100 ph10 358 .
101 ph10 480 .
102 chpe 1055 .SS "Limiting \fBpcre[16|32]_exec()\fP's stack usage"
103 ph10 358 .rs
104     .sp
105 ph10 480 You can set limits on the number of times that \fBmatch()\fP is called, both in
106 chpe 1055 total and recursively. If a limit is exceeded, \fBpcre[16|32]_exec()\fP returns an
107 ph10 480 error code. Setting suitable limits should prevent it from running out of
108     stack. The default values of the limits are very large, and unlikely ever to
109     operate. They can be changed when PCRE is built, and they can also be set when
110 chpe 1055 \fBpcre[16|32]_exec()\fP is called. For details of these interfaces, see the
111 ph10 358 .\" HREF
112     \fBpcrebuild\fP
113     .\"
114 ph10 480 documentation and the
115     .\" HTML <a href="pcreapi.html#extradata">
116     .\" </a>
117 chpe 1055 section on extra data for \fBpcre[16|32]_exec()\fP
118 ph10 480 .\"
119     in the
120 ph10 358 .\" HREF
121     \fBpcreapi\fP
122     .\"
123     documentation.
124 nigel 91 .P
125 ph10 358 As a very rough rule of thumb, you should reckon on about 500 bytes per
126 ph10 901 recursion. Thus, if you want to limit your stack usage to 8Mb, you should set
127     the limit at 16000 recursions. A 64Mb stack, on the other hand, can support
128     around 128000 recursions.
129 ph10 480 .P
130     In Unix-like environments, the \fBpcretest\fP test program has a command line
131     option (\fB-S\fP) that can be used to increase the size of its stack. As long
132     as the stack is large enough, another option (\fB-M\fP) can be used to find the
133     smallest limits that allow a particular pattern to match a given subject
134 chpe 1055 string. This is done by calling \fBpcre[16|32]_exec()\fP repeatedly with different
135 ph10 480 limits.
136 ph10 358 .
137 ph10 480 .
138 ph10 901 .SS "Obtaining an estimate of stack usage"
139     .rs
140     .sp
141     The actual amount of stack used per recursion can vary quite a lot, depending
142     on the compiler that was used to build PCRE and the optimization or debugging
143 ph10 903 options that were set for it. The rule of thumb value of 500 bytes mentioned
144     above may be larger or smaller than what is actually needed. A better
145 ph10 901 approximation can be obtained by running this command:
146     .sp
147     pcretest -m -C
148     .sp
149 ph10 903 The \fB-C\fP option causes \fBpcretest\fP to output information about the
150     options with which PCRE was compiled. When \fB-m\fP is also given (before
151 ph10 901 \fB-C\fP), information about stack use is given in a line like this:
152     .sp
153     Match recursion uses stack: approximate frame size = 640 bytes
154 ph10 903 .sp
155     The value is approximate because some recursions need a bit more (up to perhaps
156 ph10 901 16 more bytes).
157     .P
158 ph10 903 If the above command is given when PCRE is compiled to use the heap instead of
159     the stack for recursion, the value that is output is the size of each block
160     that is obtained from the heap.
161 ph10 901 .
162     .
163 ph10 358 .SS "Changing stack size in Unix-like systems"
164     .rs
165     .sp
166 nigel 93 In Unix-like environments, there is not often a problem with the stack unless
167     very long strings are involved, though the default limit on stack size varies
168     from system to system. Values from 8Mb to 64Mb are common. You can find your
169     default limit by running the command:
170 nigel 91 .sp
171     ulimit -s
172     .sp
173 nigel 93 Unfortunately, the effect of running out of stack is often SIGSEGV, though
174     sometimes a more explicit error message is given. You can normally increase the
175     limit on stack size by code such as this:
176 nigel 91 .sp
177     struct rlimit rlim;
178     getrlimit(RLIMIT_STACK, &rlim);
179     rlim.rlim_cur = 100*1024*1024;
180     setrlimit(RLIMIT_STACK, &rlim);
181     .sp
182     This reads the current limits (soft and hard) using \fBgetrlimit()\fP, then
183     attempts to increase the soft limit to 100Mb using \fBsetrlimit()\fP. You must
184 chpe 1055 do this before calling \fBpcre[16|32]_exec()\fP.
185 ph10 358 .
186 ph10 480 .
187 ph10 358 .SS "Changing stack size in Mac OS X"
188     .rs
189     .sp
190     Using \fBsetrlimit()\fP, as described above, should also work on Mac OS X. It
191     is also possible to set a stack size when linking a program. There is a
192     discussion about stack sizes in Mac OS X at this web site:
193     .\" HTML <a href="http://developer.apple.com/qa/qa2005/qa1419.html">
194     .\" </a>
195     http://developer.apple.com/qa/qa2005/qa1419.html.
196 nigel 91 .\"
197 ph10 99 .
198     .
199     .SH AUTHOR
200     .rs
201     .sp
202     .nf
203     Philip Hazel
204     University Computing Service
205     Cambridge CB2 3QH, England.
206     .fi
207     .
208     .
209     .SH REVISION
210     .rs
211     .sp
212     .nf
213 chpe 1055 Last updated: 24 June 2012
214 ph10 861 Copyright (c) 1997-2012 University of Cambridge.
215 ph10 99 .fi


Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

ViewVC Help
Powered by ViewVC 1.1.12