/[pcre]/code/trunk/HACKING
ViewVC logotype

Contents of /code/trunk/HACKING

Parent Directory Parent Directory | Revision Log Revision Log


Revision 212 - (show annotations) (download)
Thu Aug 9 11:16:34 2007 UTC (7 years, 3 months ago) by ph10
File size: 17529 byte(s)
Updating docs for release; fix heap-related bugs in pcre_exec shown up by 
release testing.

1 Technical Notes about PCRE
2 --------------------------
3
4 These are very rough technical notes that record potentially useful information
5 about PCRE internals.
6
7 Historical note 1
8 -----------------
9
10 Many years ago I implemented some regular expression functions to an algorithm
11 suggested by Martin Richards. These were not Unix-like in form, and were quite
12 restricted in what they could do by comparison with Perl. The interesting part
13 about the algorithm was that the amount of space required to hold the compiled
14 form of an expression was known in advance. The code to apply an expression did
15 not operate by backtracking, as the original Henry Spencer code and current
16 Perl code does, but instead checked all possibilities simultaneously by keeping
17 a list of current states and checking all of them as it advanced through the
18 subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA
19 algorithm", though it was not a traditional Finite State Machine (FSM). When
20 the pattern was all used up, all remaining states were possible matches, and
21 the one matching the longest subset of the subject string was chosen. This did
22 not necessarily maximize the individual wild portions of the pattern, as is
23 expected in Unix and Perl-style regular expressions.
24
25 Historical note 2
26 -----------------
27
28 By contrast, the code originally written by Henry Spencer (which was
29 subsequently heavily modified for Perl) compiles the expression twice: once in
30 a dummy mode in order to find out how much store will be needed, and then for
31 real. (The Perl version probably doesn't do this any more; I'm talking about
32 the original library.) The execution function operates by backtracking and
33 maximizing (or, optionally, minimizing in Perl) the amount of the subject that
34 matches individual wild portions of the pattern. This is an "NFA algorithm" in
35 Friedl's terminology.
36
37 OK, here's the real stuff
38 -------------------------
39
40 For the set of functions that form the "basic" PCRE library (which are
41 unrelated to those mentioned above), I tried at first to invent an algorithm
42 that used an amount of store bounded by a multiple of the number of characters
43 in the pattern, to save on compiling time. However, because of the greater
44 complexity in Perl regular expressions, I couldn't do this. In any case, a
45 first pass through the pattern is helpful for other reasons.
46
47 Computing the memory requirement: how it was
48 --------------------------------------------
49
50 Up to and including release 6.7, PCRE worked by running a very degenerate first
51 pass to calculate a maximum store size, and then a second pass to do the real
52 compile - which might use a bit less than the predicted amount of memory. The
53 idea was that this would turn out faster than the Henry Spencer code because
54 the first pass is degenerate and the second pass can just store stuff straight
55 into the vector, which it knows is big enough.
56
57 Computing the memory requirement: how it is
58 -------------------------------------------
59
60 By the time I was working on a potential 6.8 release, the degenerate first pass
61 had become very complicated and hard to maintain. Indeed one of the early
62 things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
63 I had a flash of inspiration as to how I could run the real compile function in
64 a "fake" mode that enables it to compute how much memory it would need, while
65 actually only ever using a few hundred bytes of working memory, and without too
66 many tests of the mode that might slow it down. So I re-factored the compiling
67 functions to work this way. This got rid of about 600 lines of source. It
68 should make future maintenance and development easier. As this was such a major
69 change, I never released 6.8, instead upping the number to 7.0 (other quite
70 major changes are also present in the 7.0 release).
71
72 A side effect of this work is that the previous limit of 200 on the nesting
73 depth of parentheses was removed. However, there is a downside: pcre_compile()
74 runs more slowly than before (30% or more, depending on the pattern) because it
75 is doing a full analysis of the pattern. My hope is that this is not a big
76 issue.
77
78 Traditional matching function
79 -----------------------------
80
81 The "traditional", and original, matching function is called pcre_exec(), and
82 it implements an NFA algorithm, similar to the original Henry Spencer algorithm
83 and the way that Perl works. Not surprising, since it is intended to be as
84 compatible with Perl as possible. This is the function most users of PCRE will
85 use most of the time.
86
87 Supplementary matching function
88 -------------------------------
89
90 From PCRE 6.0, there is also a supplementary matching function called
91 pcre_dfa_exec(). This implements a DFA matching algorithm that searches
92 simultaneously for all possible matches that start at one point in the subject
93 string. (Going back to my roots: see Historical Note 1 above.) This function
94 intreprets the same compiled pattern data as pcre_exec(); however, not all the
95 facilities are available, and those that are do not always work in quite the
96 same way. See the user documentation for details.
97
98 The algorithm that is used for pcre_dfa_exec() is not a traditional FSM,
99 because it may have a number of states active at one time. More work would be
100 needed at compile time to produce a traditional FSM where only one state is
101 ever active at once. I believe some other regex matchers work this way.
102
103
104 Format of compiled patterns
105 ---------------------------
106
107 The compiled form of a pattern is a vector of bytes, containing items of
108 variable length. The first byte in an item is an opcode, and the length of the
109 item is either implicit in the opcode or contained in the data bytes that
110 follow it.
111
112 In many cases below LINK_SIZE data values are specified for offsets within the
113 compiled pattern. The default value for LINK_SIZE is 2, but PCRE can be
114 compiled to use 3-byte or 4-byte values for these offsets (impairing the
115 performance). This is necessary only when patterns whose compiled length is
116 greater than 64K are going to be processed. In this description, we assume the
117 "normal" compilation options. Data values that are counts (e.g. for
118 quantifiers) are always just two bytes long.
119
120 A list of the opcodes follows:
121
122 Opcodes with no following data
123 ------------------------------
124
125 These items are all just one byte long
126
127 OP_END end of pattern
128 OP_ANY match any character
129 OP_ANYBYTE match any single byte, even in UTF-8 mode
130 OP_SOD match start of data: \A
131 OP_SOM, start of match (subject + offset): \G
132 OP_SET_SOM, set start of match (\K)
133 OP_CIRC ^ (start of data, or after \n in multiline)
134 OP_NOT_WORD_BOUNDARY \W
135 OP_WORD_BOUNDARY \w
136 OP_NOT_DIGIT \D
137 OP_DIGIT \d
138 OP_NOT_HSPACE \H
139 OP_HSPACE \h
140 OP_NOT_WHITESPACE \S
141 OP_WHITESPACE \s
142 OP_NOT_VSPACE \V
143 OP_VSPACE \v
144 OP_NOT_WORDCHAR \W
145 OP_WORDCHAR \w
146 OP_EODN match end of data or \n at end: \Z
147 OP_EOD match end of data: \z
148 OP_DOLL $ (end of data, or before \n in multiline)
149 OP_EXTUNI match an extended Unicode character
150 OP_ANYNL match any Unicode newline sequence
151
152 OP_ACCEPT )
153 OP_COMMIT )
154 OP_FAIL ) These are Perl 5.10's "backtracking
155 OP_PRUNE ) control verbs".
156 OP_SKIP )
157 OP_THEN )
158
159
160 Repeating single characters
161 ---------------------------
162
163 The common repeats (*, +, ?) when applied to a single character use the
164 following opcodes:
165
166 OP_STAR
167 OP_MINSTAR
168 OP_POSSTAR
169 OP_PLUS
170 OP_MINPLUS
171 OP_POSPLUS
172 OP_QUERY
173 OP_MINQUERY
174 OP_POSQUERY
175
176 In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.
177 Those with "MIN" in their name are the minimizing versions. Those with "POS" in
178 their names are possessive versions. Each is followed by the character that is
179 to be repeated. Other repeats make use of
180
181 OP_UPTO
182 OP_MINUPTO
183 OP_POSUPTO
184 OP_EXACT
185
186 which are followed by a two-byte count (most significant first) and the
187 repeated character. OP_UPTO matches from 0 to the given number. A repeat with a
188 non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an
189 OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).
190
191
192 Repeating character types
193 -------------------------
194
195 Repeats of things like \d are done exactly as for single characters, except
196 that instead of a character, the opcode for the type is stored in the data
197 byte. The opcodes are:
198
199 OP_TYPESTAR
200 OP_TYPEMINSTAR
201 OP_TYPEPOSSTAR
202 OP_TYPEPLUS
203 OP_TYPEMINPLUS
204 OP_TYPEPOSPLUS
205 OP_TYPEQUERY
206 OP_TYPEMINQUERY
207 OP_TYPEPOSQUERY
208 OP_TYPEUPTO
209 OP_TYPEMINUPTO
210 OP_TYPEPOSUPTO
211 OP_TYPEEXACT
212
213
214 Match by Unicode property
215 -------------------------
216
217 OP_PROP and OP_NOTPROP are used for positive and negative matches of a
218 character by testing its Unicode property (the \p and \P escape sequences).
219 Each is followed by two bytes that encode the desired property as a type and a
220 value.
221
222 Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
223 three bytes: OP_PROP or OP_NOTPROP and then the desired property type and
224 value.
225
226
227 Matching literal characters
228 ---------------------------
229
230 The OP_CHAR opcode is followed by a single character that is to be matched
231 casefully. For caseless matching, OP_CHARNC is used. In UTF-8 mode, the
232 character may be more than one byte long. (Earlier versions of PCRE used
233 multi-character strings, but this was changed to allow some new features to be
234 added.)
235
236
237 Character classes
238 -----------------
239
240 If there is only one character, OP_CHAR or OP_CHARNC is used for a positive
241 class, and OP_NOT for a negative one (that is, for something like [^a]).
242 However, in UTF-8 mode, the use of OP_NOT applies only to characters with
243 values < 128, because OP_NOT is confined to single bytes.
244
245 Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated,
246 negated, single-character class. The normal ones (OP_STAR etc.) are used for a
247 repeated positive single-character class.
248
249 When there's more than one character in a class and all the characters are less
250 than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative
251 one. In either case, the opcode is followed by a 32-byte bit map containing a 1
252 bit for every character that is acceptable. The bits are counted from the least
253 significant end of each byte.
254
255 The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,
256 subject characters with values greater than 256 can be handled correctly. For
257 OP_CLASS they don't match, whereas for OP_NCLASS they do.
258
259 For classes containing characters with values > 255, OP_XCLASS is used. It
260 optionally uses a bit map (if any characters lie within it), followed by a list
261 of pairs and single characters. There is a flag character than indicates
262 whether it's a positive or a negative class.
263
264
265 Back references
266 ---------------
267
268 OP_REF is followed by two bytes containing the reference number.
269
270
271 Repeating character classes and back references
272 -----------------------------------------------
273
274 Single-character classes are handled specially (see above). This section
275 applies to OP_CLASS and OP_REF. In both cases, the repeat information follows
276 the base item. The matching code looks at the following opcode to see if it is
277 one of
278
279 OP_CRSTAR
280 OP_CRMINSTAR
281 OP_CRPLUS
282 OP_CRMINPLUS
283 OP_CRQUERY
284 OP_CRMINQUERY
285 OP_CRRANGE
286 OP_CRMINRANGE
287
288 All but the last two are just single-byte items. The others are followed by
289 four bytes of data, comprising the minimum and maximum repeat counts. There are
290 no special possessive opcodes for these repeats; a possessive repeat is
291 compiled into an atomic group.
292
293
294 Brackets and alternation
295 ------------------------
296
297 A pair of non-capturing (round) brackets is wrapped round each expression at
298 compile time, so alternation always happens in the context of brackets.
299
300 [Note for North Americans: "bracket" to some English speakers, including
301 myself, can be round, square, curly, or pointy. Hence this usage.]
302
303 Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99
304 capturing brackets and it used a different opcode for each one. From release
305 3.5, the limit was removed by putting the bracket number into the data for
306 higher-numbered brackets. From release 7.0 all capturing brackets are handled
307 this way, using the single opcode OP_CBRA.
308
309 A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
310 next alternative OP_ALT or, if there aren't any branches, to the matching
311 OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
312 the next one, or to the OP_KET opcode. For capturing brackets, the bracket
313 number immediately follows the offset, always as a 2-byte item.
314
315 OP_KET is used for subpatterns that do not repeat indefinitely, while
316 OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
317 maximally respectively. All three are followed by LINK_SIZE bytes giving (as a
318 positive number) the offset back to the matching bracket opcode.
319
320 If a subpattern is quantified such that it is permitted to match zero times, it
321 is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte
322 opcodes which tell the matcher that skipping this subpattern entirely is a
323 valid branch.
324
325 A subpattern with an indefinite maximum repetition is replicated in the
326 compiled data its minimum number of times (or once with OP_BRAZERO if the
327 minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX
328 as appropriate.
329
330 A subpattern with a bounded maximum repetition is replicated in a nested
331 fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO
332 before each replication after the minimum, so that, for example, (abc){2,5} is
333 compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group
334 has the same number.
335
336 When a repeated subpattern has an unbounded upper limit, it is checked to see
337 whether it could match an empty string. If this is the case, the opcode in the
338 final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher
339 that it needs to check for matching an empty string when it hits OP_KETRMIN or
340 OP_KETRMAX, and if so, to break the loop.
341
342
343 Assertions
344 ----------
345
346 Forward assertions are just like other subpatterns, but starting with one of
347 the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
348 OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
349 is OP_REVERSE, followed by a two byte count of the number of characters to move
350 back the pointer in the subject string. When operating in UTF-8 mode, the count
351 is a character count rather than a byte count. A separate count is present in
352 each alternative of a lookbehind assertion, allowing them to have different
353 fixed lengths.
354
355
356 Once-only (atomic) subpatterns
357 ------------------------------
358
359 These are also just like other subpatterns, but they start with the opcode
360 OP_ONCE. The check for matching an empty string in an unbounded repeat is
361 handled entirely at runtime, so there is just this one opcode.
362
363
364 Conditional subpatterns
365 -----------------------
366
367 These are like other subpatterns, but they start with the opcode OP_COND, or
368 OP_SCOND for one that might match an empty string in an unbounded repeat. If
369 the condition is a back reference, this is stored at the start of the
370 subpattern using the opcode OP_CREF followed by two bytes containing the
371 reference number. If the condition is "in recursion" (coded as "(?(R)"), or "in
372 recursion of group x" (coded as "(?(Rx)"), the group number is stored at the
373 start of the subpattern using the opcode OP_RREF, and a value of zero for "the
374 whole pattern". For a DEFINE condition, just the single byte OP_DEF is used (it
375 has no associated data). Otherwise, a conditional subpattern always starts with
376 one of the assertions.
377
378
379 Recursion
380 ---------
381
382 Recursion either matches the current regex, or some subexpression. The opcode
383 OP_RECURSE is followed by an value which is the offset to the starting bracket
384 from the start of the whole pattern. From release 6.5, OP_RECURSE is
385 automatically wrapped inside OP_ONCE brackets (because otherwise some patterns
386 broke it). OP_RECURSE is also used for "subroutine" calls, even though they
387 are not strictly a recursion.
388
389
390 Callout
391 -------
392
393 OP_CALLOUT is followed by one byte of data that holds a callout number in the
394 range 0 to 254 for manual callouts, or 255 for an automatic callout. In both
395 cases there follows a two-byte value giving the offset in the pattern to the
396 start of the following item, and another two-byte item giving the length of the
397 next item.
398
399
400 Changing options
401 ----------------
402
403 If any of the /i, /m, or /s options are changed within a pattern, an OP_OPT
404 opcode is compiled, followed by one byte containing the new settings of these
405 flags. If there are several alternatives, there is an occurrence of OP_OPT at
406 the start of all those following the first options change, to set appropriate
407 options for the start of the alternative. Immediately after the end of the
408 group there is another such item to reset the flags to their previous values. A
409 change of flag right at the very start of the pattern can be handled entirely
410 at compile time, and so does not cause anything to be put into the compiled
411 data.
412
413 Philip Hazel
414 August 2007

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12