/[pcre]/code/trunk/HACKING
ViewVC logotype

Contents of /code/trunk/HACKING

Parent Directory Parent Directory | Revision Log Revision Log


Revision 181 - (show annotations) (download)
Wed Jun 13 14:55:18 2007 UTC (7 years, 2 months ago) by ph10
File size: 17297 byte(s)
Documentation update preparatory to release.

1 Technical Notes about PCRE
2 --------------------------
3
4 These are very rough technical notes that record potentially useful information
5 about PCRE internals.
6
7 Historical note 1
8 -----------------
9
10 Many years ago I implemented some regular expression functions to an algorithm
11 suggested by Martin Richards. These were not Unix-like in form, and were quite
12 restricted in what they could do by comparison with Perl. The interesting part
13 about the algorithm was that the amount of space required to hold the compiled
14 form of an expression was known in advance. The code to apply an expression did
15 not operate by backtracking, as the original Henry Spencer code and current
16 Perl code does, but instead checked all possibilities simultaneously by keeping
17 a list of current states and checking all of them as it advanced through the
18 subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA
19 algorithm", though it was not a traditional Finite State Machine (FSM). When
20 the pattern was all used up, all remaining states were possible matches, and
21 the one matching the longest subset of the subject string was chosen. This did
22 not necessarily maximize the individual wild portions of the pattern, as is
23 expected in Unix and Perl-style regular expressions.
24
25 Historical note 2
26 -----------------
27
28 By contrast, the code originally written by Henry Spencer (which was
29 subsequently heavily modified for Perl) compiles the expression twice: once in
30 a dummy mode in order to find out how much store will be needed, and then for
31 real. (The Perl version probably doesn't do this any more; I'm talking about
32 the original library.) The execution function operates by backtracking and
33 maximizing (or, optionally, minimizing in Perl) the amount of the subject that
34 matches individual wild portions of the pattern. This is an "NFA algorithm" in
35 Friedl's terminology.
36
37 OK, here's the real stuff
38 -------------------------
39
40 For the set of functions that form the "basic" PCRE library (which are
41 unrelated to those mentioned above), I tried at first to invent an algorithm
42 that used an amount of store bounded by a multiple of the number of characters
43 in the pattern, to save on compiling time. However, because of the greater
44 complexity in Perl regular expressions, I couldn't do this. In any case, a
45 first pass through the pattern is helpful for other reasons.
46
47 Computing the memory requirement: how it was
48 --------------------------------------------
49
50 Up to and including release 6.7, PCRE worked by running a very degenerate first
51 pass to calculate a maximum store size, and then a second pass to do the real
52 compile - which might use a bit less than the predicted amount of memory. The
53 idea was that this would turn out faster than the Henry Spencer code because
54 the first pass is degenerate and the second pass can just store stuff straight
55 into the vector, which it knows is big enough.
56
57 Computing the memory requirement: how it is
58 -------------------------------------------
59
60 By the time I was working on a potential 6.8 release, the degenerate first pass
61 had become very complicated and hard to maintain. Indeed one of the early
62 things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
63 I had a flash of inspiration as to how I could run the real compile function in
64 a "fake" mode that enables it to compute how much memory it would need, while
65 actually only ever using a few hundred bytes of working memory, and without too
66 many tests of the mode that might slow it down. So I re-factored the compiling
67 functions to work this way. This got rid of about 600 lines of source. It
68 should make future maintenance and development easier. As this was such a major
69 change, I never released 6.8, instead upping the number to 7.0 (other quite
70 major changes are also present in the 7.0 release).
71
72 A side effect of this work is that the previous limit of 200 on the nesting
73 depth of parentheses was removed. However, there is a downside: pcre_compile()
74 runs more slowly than before (30% or more, depending on the pattern) because it
75 is doing a full analysis of the pattern. My hope is that this is not a big
76 issue.
77
78 Traditional matching function
79 -----------------------------
80
81 The "traditional", and original, matching function is called pcre_exec(), and
82 it implements an NFA algorithm, similar to the original Henry Spencer algorithm
83 and the way that Perl works. Not surprising, since it is intended to be as
84 compatible with Perl as possible. This is the function most users of PCRE will
85 use most of the time.
86
87 Supplementary matching function
88 -------------------------------
89
90 From PCRE 6.0, there is also a supplementary matching function called
91 pcre_dfa_exec(). This implements a DFA matching algorithm that searches
92 simultaneously for all possible matches that start at one point in the subject
93 string. (Going back to my roots: see Historical Note 1 above.) This function
94 intreprets the same compiled pattern data as pcre_exec(); however, not all the
95 facilities are available, and those that are do not always work in quite the
96 same way. See the user documentation for details.
97
98 The algorithm that is used for pcre_dfa_exec() is not a traditional FSM,
99 because it may have a number of states active at one time. More work would be
100 needed at compile time to produce a traditional FSM where only one state is
101 ever active at once. I believe some other regex matchers work this way.
102
103
104 Format of compiled patterns
105 ---------------------------
106
107 The compiled form of a pattern is a vector of bytes, containing items of
108 variable length. The first byte in an item is an opcode, and the length of the
109 item is either implicit in the opcode or contained in the data bytes that
110 follow it.
111
112 In many cases below "two-byte" data values are specified. This is in fact just
113 a default when the number is an offset within the compiled pattern. PCRE can be
114 compiled to use 3-byte or 4-byte values for these offsets (impairing the
115 performance). This is necessary only when patterns whose compiled length is
116 greater than 64K are going to be processed. In this description, we assume the
117 "normal" compilation options. "Two-byte" data values that are counts (e.g. for
118 quantifiers) are always just two bytes.
119
120 A list of all the opcodes follows:
121
122 Opcodes with no following data
123 ------------------------------
124
125 These items are all just one byte long
126
127 OP_END end of pattern
128 OP_ANY match any character
129 OP_ANYBYTE match any single byte, even in UTF-8 mode
130 OP_SOD match start of data: \A
131 OP_SOM, start of match (subject + offset): \G
132 OP_SET_SOM, set start of match (\K)
133 OP_CIRC ^ (start of data, or after \n in multiline)
134 OP_NOT_WORD_BOUNDARY \W
135 OP_WORD_BOUNDARY \w
136 OP_NOT_DIGIT \D
137 OP_DIGIT \d
138 OP_NOT_HSPACE \H
139 OP_HSPACE \h
140 OP_NOT_WHITESPACE \S
141 OP_WHITESPACE \s
142 OP_NOT_VSPACE \V
143 OP_VSPACE \v
144 OP_NOT_WORDCHAR \W
145 OP_WORDCHAR \w
146 OP_EODN match end of data or \n at end: \Z
147 OP_EOD match end of data: \z
148 OP_DOLL $ (end of data, or before \n in multiline)
149 OP_EXTUNI match an extended Unicode character
150 OP_ANYNL match any Unicode newline sequence
151
152
153 Repeating single characters
154 ---------------------------
155
156 The common repeats (*, +, ?) when applied to a single character use the
157 following opcodes:
158
159 OP_STAR
160 OP_MINSTAR
161 OP_POSSTAR
162 OP_PLUS
163 OP_MINPLUS
164 OP_POSPLUS
165 OP_QUERY
166 OP_MINQUERY
167 OP_POSQUERY
168
169 In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.
170 Those with "MIN" in their name are the minimizing versions. Those with "POS" in
171 their names are possessive versions. Each is followed by the character that is
172 to be repeated. Other repeats make use of
173
174 OP_UPTO
175 OP_MINUPTO
176 OP_POSUPTO
177 OP_EXACT
178
179 which are followed by a two-byte count (most significant first) and the
180 repeated character. OP_UPTO matches from 0 to the given number. A repeat with a
181 non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an
182 OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).
183
184
185 Repeating character types
186 -------------------------
187
188 Repeats of things like \d are done exactly as for single characters, except
189 that instead of a character, the opcode for the type is stored in the data
190 byte. The opcodes are:
191
192 OP_TYPESTAR
193 OP_TYPEMINSTAR
194 OP_TYPEPOSSTAR
195 OP_TYPEPLUS
196 OP_TYPEMINPLUS
197 OP_TYPEPOSPLUS
198 OP_TYPEQUERY
199 OP_TYPEMINQUERY
200 OP_TYPEPOSQUERY
201 OP_TYPEUPTO
202 OP_TYPEMINUPTO
203 OP_TYPEPOSUPTO
204 OP_TYPEEXACT
205
206
207 Match by Unicode property
208 -------------------------
209
210 OP_PROP and OP_NOTPROP are used for positive and negative matches of a
211 character by testing its Unicode property (the \p and \P escape sequences).
212 Each is followed by two bytes that encode the desired property as a type and a
213 value.
214
215 Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
216 three bytes: OP_PROP or OP_NOTPROP and then the desired property type and
217 value.
218
219
220 Matching literal characters
221 ---------------------------
222
223 The OP_CHAR opcode is followed by a single character that is to be matched
224 casefully. For caseless matching, OP_CHARNC is used. In UTF-8 mode, the
225 character may be more than one byte long. (Earlier versions of PCRE used
226 multi-character strings, but this was changed to allow some new features to be
227 added.)
228
229
230 Character classes
231 -----------------
232
233 If there is only one character, OP_CHAR or OP_CHARNC is used for a positive
234 class, and OP_NOT for a negative one (that is, for something like [^a]).
235 However, in UTF-8 mode, the use of OP_NOT applies only to characters with
236 values < 128, because OP_NOT is confined to single bytes.
237
238 Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated,
239 negated, single-character class. The normal ones (OP_STAR etc.) are used for a
240 repeated positive single-character class.
241
242 When there's more than one character in a class and all the characters are less
243 than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative
244 one. In either case, the opcode is followed by a 32-byte bit map containing a 1
245 bit for every character that is acceptable. The bits are counted from the least
246 significant end of each byte.
247
248 The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,
249 subject characters with values greater than 256 can be handled correctly. For
250 OP_CLASS they don't match, whereas for OP_NCLASS they do.
251
252 For classes containing characters with values > 255, OP_XCLASS is used. It
253 optionally uses a bit map (if any characters lie within it), followed by a list
254 of pairs and single characters. There is a flag character than indicates
255 whether it's a positive or a negative class.
256
257
258 Back references
259 ---------------
260
261 OP_REF is followed by two bytes containing the reference number.
262
263
264 Repeating character classes and back references
265 -----------------------------------------------
266
267 Single-character classes are handled specially (see above). This section
268 applies to OP_CLASS and OP_REF. In both cases, the repeat information follows
269 the base item. The matching code looks at the following opcode to see if it is
270 one of
271
272 OP_CRSTAR
273 OP_CRMINSTAR
274 OP_CRPLUS
275 OP_CRMINPLUS
276 OP_CRQUERY
277 OP_CRMINQUERY
278 OP_CRRANGE
279 OP_CRMINRANGE
280
281 All but the last two are just single-byte items. The others are followed by
282 four bytes of data, comprising the minimum and maximum repeat counts. There are
283 no special possessive opcodes for these repeats; a possessive repeat is
284 compiled into an atomic group.
285
286
287 Brackets and alternation
288 ------------------------
289
290 A pair of non-capturing (round) brackets is wrapped round each expression at
291 compile time, so alternation always happens in the context of brackets.
292
293 [Note for North Americans: "bracket" to some English speakers, including
294 myself, can be round, square, curly, or pointy. Hence this usage.]
295
296 Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99
297 capturing brackets and it used a different opcode for each one. From release
298 3.5, the limit was removed by putting the bracket number into the data for
299 higher-numbered brackets. From release 7.0 all capturing brackets are handled
300 this way, using the single opcode OP_CBRA.
301
302 A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
303 next alternative OP_ALT or, if there aren't any branches, to the matching
304 OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
305 the next one, or to the OP_KET opcode. For capturing brackets, the bracket
306 number immediately follows the offset, always as a 2-byte item.
307
308 OP_KET is used for subpatterns that do not repeat indefinitely, while
309 OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
310 maximally respectively. All three are followed by LINK_SIZE bytes giving (as a
311 positive number) the offset back to the matching bracket opcode.
312
313 If a subpattern is quantified such that it is permitted to match zero times, it
314 is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte
315 opcodes which tell the matcher that skipping this subpattern entirely is a
316 valid branch.
317
318 A subpattern with an indefinite maximum repetition is replicated in the
319 compiled data its minimum number of times (or once with OP_BRAZERO if the
320 minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX
321 as appropriate.
322
323 A subpattern with a bounded maximum repetition is replicated in a nested
324 fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO
325 before each replication after the minimum, so that, for example, (abc){2,5} is
326 compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group
327 has the same number.
328
329 When a repeated subpattern has an unbounded upper limit, it is checked to see
330 whether it could match an empty string. If this is the case, the opcode in the
331 final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher
332 that it needs to check for matching an empty string when it hits OP_KETRMIN or
333 OP_KETRMAX, and if so, to break the loop.
334
335
336 Assertions
337 ----------
338
339 Forward assertions are just like other subpatterns, but starting with one of
340 the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
341 OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
342 is OP_REVERSE, followed by a two byte count of the number of characters to move
343 back the pointer in the subject string. When operating in UTF-8 mode, the count
344 is a character count rather than a byte count. A separate count is present in
345 each alternative of a lookbehind assertion, allowing them to have different
346 fixed lengths.
347
348
349 Once-only (atomic) subpatterns
350 ------------------------------
351
352 These are also just like other subpatterns, but they start with the opcode
353 OP_ONCE. The check for matching an empty string in an unbounded repeat is
354 handled entirely at runtime, so there is just this one opcode.
355
356
357 Conditional subpatterns
358 -----------------------
359
360 These are like other subpatterns, but they start with the opcode OP_COND, or
361 OP_SCOND for one that might match an empty string in an unbounded repeat. If
362 the condition is a back reference, this is stored at the start of the
363 subpattern using the opcode OP_CREF followed by two bytes containing the
364 reference number. If the condition is "in recursion" (coded as "(?(R)"), or "in
365 recursion of group x" (coded as "(?(Rx)"), the group number is stored at the
366 start of the subpattern using the opcode OP_RREF, and a value of zero for "the
367 whole pattern". For a DEFINE condition, just the single byte OP_DEF is used (it
368 has no associated data). Otherwise, a conditional subpattern always starts with
369 one of the assertions.
370
371
372 Recursion
373 ---------
374
375 Recursion either matches the current regex, or some subexpression. The opcode
376 OP_RECURSE is followed by an value which is the offset to the starting bracket
377 from the start of the whole pattern. From release 6.5, OP_RECURSE is
378 automatically wrapped inside OP_ONCE brackets (because otherwise some patterns
379 broke it). OP_RECURSE is also used for "subroutine" calls, even though they
380 are not strictly a recursion.
381
382
383 Callout
384 -------
385
386 OP_CALLOUT is followed by one byte of data that holds a callout number in the
387 range 0 to 254 for manual callouts, or 255 for an automatic callout. In both
388 cases there follows a two-byte value giving the offset in the pattern to the
389 start of the following item, and another two-byte item giving the length of the
390 next item.
391
392
393 Changing options
394 ----------------
395
396 If any of the /i, /m, or /s options are changed within a pattern, an OP_OPT
397 opcode is compiled, followed by one byte containing the new settings of these
398 flags. If there are several alternatives, there is an occurrence of OP_OPT at
399 the start of all those following the first options change, to set appropriate
400 options for the start of the alternative. Immediately after the end of the
401 group there is another such item to reset the flags to their previous values. A
402 change of flag right at the very start of the pattern can be handled entirely
403 at compile time, and so does not cause anything to be put into the compiled
404 data.
405
406 Philip Hazel
407 June 2007

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12