/[pcre]/code/trunk/Tech.Notes
ViewVC logotype

Diff of /code/trunk/Tech.Notes

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 3 by nigel, Sat Feb 24 21:38:01 2007 UTC revision 27 by nigel, Sat Feb 24 21:38:49 2007 UTC
# Line 18  expected in Unix and Perl-style regular Line 18  expected in Unix and Perl-style regular
18  By contrast, the code originally written by Henry Spencer and subsequently  By contrast, the code originally written by Henry Spencer and subsequently
19  heavily modified for Perl actually compiles the expression twice: once in a  heavily modified for Perl actually compiles the expression twice: once in a
20  dummy mode in order to find out how much store will be needed, and then for  dummy mode in order to find out how much store will be needed, and then for
21  real. The execution function operates by backtracking and maximizing (or  real. The execution function operates by backtracking and maximizing (or,
22  minimizing in Perl) the amount of the subject that matches individual wild  optionally, minimizing in Perl) the amount of the subject that matches
23  portions of the pattern. This is a "NFA algorithm".  individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's
24    terminology.
25  For this set of functions, I tried at first to invent an algorithm that used an  
26  amount of store bounded by a multiple of the number of characters in the  For this set of functions that forms PCRE, I tried at first to invent an
27  pattern, to save on compiling time. However, because of the greater complexity  algorithm that used an amount of store bounded by a multiple of the number of
28  in Perl regular expressions, I couldn't do this. In any case, a first pass  characters in the pattern, to save on compiling time. However, because of the
29  through the pattern is needed, in order to find internal flag settings like  greater complexity in Perl regular expressions, I couldn't do this. In any
30  (?i). So it works by running a very degenerate first pass to calculate a  case, a first pass through the pattern is needed, in order to find internal
31  maximum store size, and then a second pass to do the real compile - which may  flag settings like (?i) at top level. So it works by running a very degenerate
32  use a bit less than the predicted amount of store. The idea is that this is  first pass to calculate a maximum store size, and then a second pass to do the
33  going to turn out faster because the first pass is degenerate and the second  real compile - which may use a bit less than the predicted amount of store. The
34  can just store stuff straight into the vector. It does make the compiling  idea is that this is going to turn out faster because the first pass is
35  functions bigger, of course, but they have got quite big anyway to handle all  degenerate and the second can just store stuff straight into the vector. It
36  the Perl stuff.  does make the compiling functions bigger, of course, but they have got quite
37    big anyway to handle all the Perl stuff.
38    
39  The compiled form of a pattern is a vector of bytes, containing items of  The compiled form of a pattern is a vector of bytes, containing items of
40  variable length. The first byte in an item is an opcode, and the length of the  variable length. The first byte in an item is an opcode, and the length of the
# Line 57  These items are all just one byte long Line 58  These items are all just one byte long
58    OP_WHITESPACE          \s    OP_WHITESPACE          \s
59    OP_NOT_WORDCHAR        \W    OP_NOT_WORDCHAR        \W
60    OP_WORDCHAR            \w    OP_WORDCHAR            \w
61    OP_CUT                 analogue of Prolog's "cut"    OP_EODN                match end of data or \n at end: \Z
62    OP_EOD                 match end of data: \Z    OP_EOD                 match end of data: \z
63    OP_DOLL                $ (end of data, or before \n in multiline)    OP_DOLL                $ (end of data, or before \n in multiline)
64    
65    
# Line 117  instances of OP_CHARS are used. Line 118  instances of OP_CHARS are used.
118  Character classes  Character classes
119  -----------------  -----------------
120    
121  OP_CLASS is used for a character class. It is followed by a 32-byte bit map  OP_CLASS is used for a character class, provided there are at least two
122  containing a 1 bit for every character that is acceptable. The bits are counted  characters in the class. If there is only one character, OP_CHARS is used for a
123  from the least significant end of each byte.  positive class, and OP_NOT for a negative one (that is, for something like
124    [^a]). Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a
125    repeated, negated, single-character class. The normal ones (OP_STAR etc.) are
126    used for a repeated positive single-character class.
127    
128    OP_CLASS is followed by a 32-byte bit map containing a 1
129    bit for every character that is acceptable. The bits are counted from the least
130    significant end of each byte.
131    
132    
133  Back references  Back references
# Line 131  OP_REF is followed by a single byte cont Line 139  OP_REF is followed by a single byte cont
139  Repeating character classes and back references  Repeating character classes and back references
140  -----------------------------------------------  -----------------------------------------------
141    
142  In both cases, the repeat information follows the base item. The matching code  Single-character classes are handled specially (see above). This applies to
143  looks at the following opcode to see if it is one of  OP_CLASS and OP_REF. In both cases, the repeat information follows the base
144    item. The matching code looks at the following opcode to see if it is one of
145    
146    OP_CRSTAR    OP_CRSTAR
147    OP_CRMINSTAR    OP_CRMINSTAR
# Line 184  minimum. In effect, (abc){2,5} becomes ( Line 193  minimum. In effect, (abc){2,5} becomes (
193  Assertions  Assertions
194  ----------  ----------
195    
196  Assertions are just like other subpatterns, but starting with one of the  Forward assertions are just like other subpatterns, but starting with one of
197  opcodes OP_ASSERT or OP_ASSERT_NOT.  the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
198    OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
199    is OP_REVERSE, followed by a two byte count of the number of characters to move
200    back the pointer in the subject string. A separate count is present in each
201    alternative of a lookbehind assertion, allowing them to have different fixed
202    lengths.
203    
204    
205  Once-only subpatterns  Once-only subpatterns
# Line 195  These are also just like other subpatter Line 209  These are also just like other subpatter
209  OP_ONCE.  OP_ONCE.
210    
211    
212    Conditional subpatterns
213    -----------------------
214    
215    These are like other subpatterns, but they start with the opcode OP_COND. If
216    the condition is a back reference, this is stored at the start of the
217    subpattern using the opcode OP_CREF followed by one byte containing the
218    reference number. Otherwise, a conditional subpattern will always start with
219    one of the assertions.
220    
221    
222    Changing options
223    ----------------
224    
225    If any of the /i, /m, or /s options are changed within a parenthesized group,
226    an OP_OPT opcode is compiled, followed by one byte containing the new settings
227    of these flags. If there are several alternatives in a group, there is an
228    occurrence of OP_OPT at the start of all those following the first options
229    change, to set appropriate options for the start of the alternative.
230    Immediately after the end of the group there is another such item to reset the
231    flags to their previous values. Other changes of flag within the pattern can be
232    handled entirely at compile time, and so do not cause anything to be put into
233    the compiled data.
234    
235    
236  Philip Hazel  Philip Hazel
237  October 1997  January 1999

Legend:
Removed from v.3  
changed lines
  Added in v.27

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12