/[pcre]/code/trunk/HACKING
ViewVC logotype

Contents of /code/trunk/HACKING

Parent Directory Parent Directory | Revision Log Revision Log


Revision 127 - (show annotations) (download)
Mon Mar 19 11:44:45 2007 UTC (7 years, 8 months ago) by ph10
File size: 17135 byte(s)
A number of tidies, file renames, etc.

1 Technical Notes about PCRE
2 --------------------------
3
4 These are very rough technical notes that record potentially useful information
5 about PCRE internals.
6
7 Historical note 1
8 -----------------
9
10 Many years ago I implemented some regular expression functions to an algorithm
11 suggested by Martin Richards. These were not Unix-like in form, and were quite
12 restricted in what they could do by comparison with Perl. The interesting part
13 about the algorithm was that the amount of space required to hold the compiled
14 form of an expression was known in advance. The code to apply an expression did
15 not operate by backtracking, as the original Henry Spencer code and current
16 Perl code does, but instead checked all possibilities simultaneously by keeping
17 a list of current states and checking all of them as it advanced through the
18 subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA
19 algorithm", though it was not a traditional Finite State Machine (FSM). When
20 the pattern was all used up, all remaining states were possible matches, and
21 the one matching the longest subset of the subject string was chosen. This did
22 not necessarily maximize the individual wild portions of the pattern, as is
23 expected in Unix and Perl-style regular expressions.
24
25 Historical note 2
26 -----------------
27
28 By contrast, the code originally written by Henry Spencer (which was
29 subsequently heavily modified for Perl) compiles the expression twice: once in
30 a dummy mode in order to find out how much store will be needed, and then for
31 real. (The Perl version probably doesn't do this any more; I'm talking about
32 the original library.) The execution function operates by backtracking and
33 maximizing (or, optionally, minimizing in Perl) the amount of the subject that
34 matches individual wild portions of the pattern. This is an "NFA algorithm" in
35 Friedl's terminology.
36
37 OK, here's the real stuff
38 -------------------------
39
40 For the set of functions that form the "basic" PCRE library (which are
41 unrelated to those mentioned above), I tried at first to invent an algorithm
42 that used an amount of store bounded by a multiple of the number of characters
43 in the pattern, to save on compiling time. However, because of the greater
44 complexity in Perl regular expressions, I couldn't do this. In any case, a
45 first pass through the pattern is helpful for other reasons.
46
47 Computing the memory requirement: how it was
48 --------------------------------------------
49
50 Up to and including release 6.7, PCRE worked by running a very degenerate first
51 pass to calculate a maximum store size, and then a second pass to do the real
52 compile - which might use a bit less than the predicted amount of memory. The
53 idea was that this would turn out faster than the Henry Spencer code because
54 the first pass is degenerate and the second pass can just store stuff straight
55 into the vector, which it knows is big enough.
56
57 Computing the memory requirement: how it is
58 -------------------------------------------
59
60 By the time I was working on a potential 6.8 release, the degenerate first pass
61 had become very complicated and hard to maintain. Indeed one of the early
62 things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
63 I had a flash of inspiration as to how I could run the real compile function in
64 a "fake" mode that enables it to compute how much memory it would need, while
65 actually only ever using a few hundred bytes of working memory, and without too
66 many tests of the mode that might slow it down. So I re-factored the compiling
67 functions to work this way. This got rid of about 600 lines of source. It
68 should make future maintenance and development easier. As this was such a major
69 change, I never released 6.8, instead upping the number to 7.0 (other quite
70 major changes are also present in the 7.0 release).
71
72 A side effect of this work is that the previous limit of 200 on the nesting
73 depth of parentheses was removed. However, there is a downside: pcre_compile()
74 runs more slowly than before (30% or more, depending on the pattern) because it
75 is doing a full analysis of the pattern. My hope is that this is not a big
76 issue.
77
78 Traditional matching function
79 -----------------------------
80
81 The "traditional", and original, matching function is called pcre_exec(), and
82 it implements an NFA algorithm, similar to the original Henry Spencer algorithm
83 and the way that Perl works. Not surprising, since it is intended to be as
84 compatible with Perl as possible. This is the function most users of PCRE will
85 use most of the time.
86
87 Supplementary matching function
88 -------------------------------
89
90 From PCRE 6.0, there is also a supplementary matching function called
91 pcre_dfa_exec(). This implements a DFA matching algorithm that searches
92 simultaneously for all possible matches that start at one point in the subject
93 string. (Going back to my roots: see Historical Note 1 above.) This function
94 intreprets the same compiled pattern data as pcre_exec(); however, not all the
95 facilities are available, and those that are do not always work in quite the
96 same way. See the user documentation for details.
97
98 The algorithm that is used for pcre_dfa_exec() is not a traditional FSM,
99 because it may have a number of states active at one time. More work would be
100 needed at compile time to produce a traditional FSM where only one state is
101 ever active at once. I believe some other regex matchers work this way.
102
103
104 Format of compiled patterns
105 ---------------------------
106
107 The compiled form of a pattern is a vector of bytes, containing items of
108 variable length. The first byte in an item is an opcode, and the length of the
109 item is either implicit in the opcode or contained in the data bytes that
110 follow it.
111
112 In many cases below "two-byte" data values are specified. This is in fact just
113 a default when the number is an offset within the compiled pattern. PCRE can be
114 compiled to use 3-byte or 4-byte values for these offsets (impairing the
115 performance). This is necessary only when patterns whose compiled length is
116 greater than 64K are going to be processed. In this description, we assume the
117 "normal" compilation options. "Two-byte" data values that are counts (e.g. for
118 quantifiers) are always just two bytes.
119
120 A list of all the opcodes follows:
121
122 Opcodes with no following data
123 ------------------------------
124
125 These items are all just one byte long
126
127 OP_END end of pattern
128 OP_ANY match any character
129 OP_ANYBYTE match any single byte, even in UTF-8 mode
130 OP_SOD match start of data: \A
131 OP_SOM, start of match (subject + offset): \G
132 OP_CIRC ^ (start of data, or after \n in multiline)
133 OP_NOT_WORD_BOUNDARY \W
134 OP_WORD_BOUNDARY \w
135 OP_NOT_DIGIT \D
136 OP_DIGIT \d
137 OP_NOT_WHITESPACE \S
138 OP_WHITESPACE \s
139 OP_NOT_WORDCHAR \W
140 OP_WORDCHAR \w
141 OP_EODN match end of data or \n at end: \Z
142 OP_EOD match end of data: \z
143 OP_DOLL $ (end of data, or before \n in multiline)
144 OP_EXTUNI match an extended Unicode character
145 OP_ANYNL match any Unicode newline sequence
146
147
148 Repeating single characters
149 ---------------------------
150
151 The common repeats (*, +, ?) when applied to a single character use the
152 following opcodes:
153
154 OP_STAR
155 OP_MINSTAR
156 OP_POSSTAR
157 OP_PLUS
158 OP_MINPLUS
159 OP_POSPLUS
160 OP_QUERY
161 OP_MINQUERY
162 OP_POSQUERY
163
164 In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.
165 Those with "MIN" in their name are the minimizing versions. Those with "POS" in
166 their names are possessive versions. Each is followed by the character that is
167 to be repeated. Other repeats make use of
168
169 OP_UPTO
170 OP_MINUPTO
171 OP_POSUPTO
172 OP_EXACT
173
174 which are followed by a two-byte count (most significant first) and the
175 repeated character. OP_UPTO matches from 0 to the given number. A repeat with a
176 non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an
177 OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).
178
179
180 Repeating character types
181 -------------------------
182
183 Repeats of things like \d are done exactly as for single characters, except
184 that instead of a character, the opcode for the type is stored in the data
185 byte. The opcodes are:
186
187 OP_TYPESTAR
188 OP_TYPEMINSTAR
189 OP_TYPEPOSSTAR
190 OP_TYPEPLUS
191 OP_TYPEMINPLUS
192 OP_TYPEPOSPLUS
193 OP_TYPEQUERY
194 OP_TYPEMINQUERY
195 OP_TYPEPOSQUERY
196 OP_TYPEUPTO
197 OP_TYPEMINUPTO
198 OP_TYPEPOSUPTO
199 OP_TYPEEXACT
200
201
202 Match by Unicode property
203 -------------------------
204
205 OP_PROP and OP_NOTPROP are used for positive and negative matches of a
206 character by testing its Unicode property (the \p and \P escape sequences).
207 Each is followed by two bytes that encode the desired property as a type and a
208 value.
209
210 Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
211 three bytes: OP_PROP or OP_NOTPROP and then the desired property type and
212 value.
213
214
215 Matching literal characters
216 ---------------------------
217
218 The OP_CHAR opcode is followed by a single character that is to be matched
219 casefully. For caseless matching, OP_CHARNC is used. In UTF-8 mode, the
220 character may be more than one byte long. (Earlier versions of PCRE used
221 multi-character strings, but this was changed to allow some new features to be
222 added.)
223
224
225 Character classes
226 -----------------
227
228 If there is only one character, OP_CHAR or OP_CHARNC is used for a positive
229 class, and OP_NOT for a negative one (that is, for something like [^a]).
230 However, in UTF-8 mode, the use of OP_NOT applies only to characters with
231 values < 128, because OP_NOT is confined to single bytes.
232
233 Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated,
234 negated, single-character class. The normal ones (OP_STAR etc.) are used for a
235 repeated positive single-character class.
236
237 When there's more than one character in a class and all the characters are less
238 than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative
239 one. In either case, the opcode is followed by a 32-byte bit map containing a 1
240 bit for every character that is acceptable. The bits are counted from the least
241 significant end of each byte.
242
243 The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,
244 subject characters with values greater than 256 can be handled correctly. For
245 OP_CLASS they don't match, whereas for OP_NCLASS they do.
246
247 For classes containing characters with values > 255, OP_XCLASS is used. It
248 optionally uses a bit map (if any characters lie within it), followed by a list
249 of pairs and single characters. There is a flag character than indicates
250 whether it's a positive or a negative class.
251
252
253 Back references
254 ---------------
255
256 OP_REF is followed by two bytes containing the reference number.
257
258
259 Repeating character classes and back references
260 -----------------------------------------------
261
262 Single-character classes are handled specially (see above). This section
263 applies to OP_CLASS and OP_REF. In both cases, the repeat information follows
264 the base item. The matching code looks at the following opcode to see if it is
265 one of
266
267 OP_CRSTAR
268 OP_CRMINSTAR
269 OP_CRPLUS
270 OP_CRMINPLUS
271 OP_CRQUERY
272 OP_CRMINQUERY
273 OP_CRRANGE
274 OP_CRMINRANGE
275
276 All but the last two are just single-byte items. The others are followed by
277 four bytes of data, comprising the minimum and maximum repeat counts. There are
278 no special possessive opcodes for these repeats; a possessive repeat is
279 compiled into an atomic group.
280
281
282 Brackets and alternation
283 ------------------------
284
285 A pair of non-capturing (round) brackets is wrapped round each expression at
286 compile time, so alternation always happens in the context of brackets.
287
288 [Note for North Americans: "bracket" to some English speakers, including
289 myself, can be round, square, curly, or pointy. Hence this usage.]
290
291 Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99
292 capturing brackets and it used a different opcode for each one. From release
293 3.5, the limit was removed by putting the bracket number into the data for
294 higher-numbered brackets. From release 7.0 all capturing brackets are handled
295 this way, using the single opcode OP_CBRA.
296
297 A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
298 next alternative OP_ALT or, if there aren't any branches, to the matching
299 OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
300 the next one, or to the OP_KET opcode. For capturing brackets, the bracket
301 number immediately follows the offset, always as a 2-byte item.
302
303 OP_KET is used for subpatterns that do not repeat indefinitely, while
304 OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
305 maximally respectively. All three are followed by LINK_SIZE bytes giving (as a
306 positive number) the offset back to the matching bracket opcode.
307
308 If a subpattern is quantified such that it is permitted to match zero times, it
309 is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte
310 opcodes which tell the matcher that skipping this subpattern entirely is a
311 valid branch.
312
313 A subpattern with an indefinite maximum repetition is replicated in the
314 compiled data its minimum number of times (or once with OP_BRAZERO if the
315 minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX
316 as appropriate.
317
318 A subpattern with a bounded maximum repetition is replicated in a nested
319 fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO
320 before each replication after the minimum, so that, for example, (abc){2,5} is
321 compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group
322 has the same number.
323
324 When a repeated subpattern has an unbounded upper limit, it is checked to see
325 whether it could match an empty string. If this is the case, the opcode in the
326 final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher
327 that it needs to check for matching an empty string when it hits OP_KETRMIN or
328 OP_KETRMAX, and if so, to break the loop.
329
330
331 Assertions
332 ----------
333
334 Forward assertions are just like other subpatterns, but starting with one of
335 the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
336 OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
337 is OP_REVERSE, followed by a two byte count of the number of characters to move
338 back the pointer in the subject string. When operating in UTF-8 mode, the count
339 is a character count rather than a byte count. A separate count is present in
340 each alternative of a lookbehind assertion, allowing them to have different
341 fixed lengths.
342
343
344 Once-only (atomic) subpatterns
345 ------------------------------
346
347 These are also just like other subpatterns, but they start with the opcode
348 OP_ONCE. The check for matching an empty string in an unbounded repeat is
349 handled entirely at runtime, so there is just this one opcode.
350
351
352 Conditional subpatterns
353 -----------------------
354
355 These are like other subpatterns, but they start with the opcode OP_COND, or
356 OP_SCOND for one that might match an empty string in an unbounded repeat. If
357 the condition is a back reference, this is stored at the start of the
358 subpattern using the opcode OP_CREF followed by two bytes containing the
359 reference number. If the condition is "in recursion" (coded as "(?(R)"), or "in
360 recursion of group x" (coded as "(?(Rx)"), the group number is stored at the
361 start of the subpattern using the opcode OP_RREF, and a value of zero for "the
362 whole pattern". For a DEFINE condition, just the single byte OP_DEF is used (it
363 has no associated data). Otherwise, a conditional subpattern always starts with
364 one of the assertions.
365
366
367 Recursion
368 ---------
369
370 Recursion either matches the current regex, or some subexpression. The opcode
371 OP_RECURSE is followed by an value which is the offset to the starting bracket
372 from the start of the whole pattern. From release 6.5, OP_RECURSE is
373 automatically wrapped inside OP_ONCE brackets (because otherwise some patterns
374 broke it). OP_RECURSE is also used for "subroutine" calls, even though they
375 are not strictly a recursion.
376
377
378 Callout
379 -------
380
381 OP_CALLOUT is followed by one byte of data that holds a callout number in the
382 range 0 to 254 for manual callouts, or 255 for an automatic callout. In both
383 cases there follows a two-byte value giving the offset in the pattern to the
384 start of the following item, and another two-byte item giving the length of the
385 next item.
386
387
388 Changing options
389 ----------------
390
391 If any of the /i, /m, or /s options are changed within a pattern, an OP_OPT
392 opcode is compiled, followed by one byte containing the new settings of these
393 flags. If there are several alternatives, there is an occurrence of OP_OPT at
394 the start of all those following the first options change, to set appropriate
395 options for the start of the alternative. Immediately after the end of the
396 group there is another such item to reset the flags to their previous values. A
397 change of flag right at the very start of the pattern can be handled entirely
398 at compile time, and so does not cause anything to be put into the compiled
399 data.
400
401 Philip Hazel
402 November 2006

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12