/[pcre]/code/tags/pcre-2.02/Tech.Notes
ViewVC logotype

Diff of /code/tags/pcre-2.02/Tech.Notes

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

code/trunk/Tech.Notes revision 23 by nigel, Sat Feb 24 21:38:41 2007 UTC code/tags/pcre-2.02/Tech.Notes revision 28 by nigel, Sat Feb 24 21:38:51 2007 UTC
# Line 23  optionally, minimizing in Perl) the amou Line 23  optionally, minimizing in Perl) the amou
23  individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's  individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's
24  terminology.  terminology.
25    
26  For this set of functions, I tried at first to invent an algorithm that used an  For this set of functions that forms PCRE, I tried at first to invent an
27  amount of store bounded by a multiple of the number of characters in the  algorithm that used an amount of store bounded by a multiple of the number of
28  pattern, to save on compiling time. However, because of the greater complexity  characters in the pattern, to save on compiling time. However, because of the
29  in Perl regular expressions, I couldn't do this. In any case, a first pass  greater complexity in Perl regular expressions, I couldn't do this. In any
30  through the pattern is needed, in order to find internal flag settings like  case, a first pass through the pattern is needed, in order to find internal
31  (?i) at top level. So it works by running a very degenerate first pass to  flag settings like (?i) at top level. So it works by running a very degenerate
32  calculate a maximum store size, and then a second pass to do the real compile -  first pass to calculate a maximum store size, and then a second pass to do the
33  which may use a bit less than the predicted amount of store. The idea is that  real compile - which may use a bit less than the predicted amount of store. The
34  this is going to turn out faster because the first pass is degenerate and the  idea is that this is going to turn out faster because the first pass is
35  second can just store stuff straight into the vector. It does make the  degenerate and the second can just store stuff straight into the vector. It
36  compiling functions bigger, of course, but they have got quite big anyway to  does make the compiling functions bigger, of course, but they have got quite
37  handle all the Perl stuff.  big anyway to handle all the Perl stuff.
38    
39  The compiled form of a pattern is a vector of bytes, containing items of  The compiled form of a pattern is a vector of bytes, containing items of
40  variable length. The first byte in an item is an opcode, and the length of the  variable length. The first byte in an item is an opcode, and the length of the
# Line 118  instances of OP_CHARS are used. Line 118  instances of OP_CHARS are used.
118  Character classes  Character classes
119  -----------------  -----------------
120    
121  OP_CLASS is used for a character class, and OP_NEGCLASS for a negated character  OP_CLASS is used for a character class, provided there are at least two
122  class, provided there are at least two characters in the class. If there is  characters in the class. If there is only one character, OP_CHARS is used for a
123  only one character, OP_CHARS is used for a positive class, and OP_NOT for a  positive class, and OP_NOT for a negative one (that is, for something like
124  negative one. A set of repeating opcodes (OP_NOTSTAR etc.) are used for a  [^a]). Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a
125  repeated, negated, single-character class.  repeated, negated, single-character class. The normal ones (OP_STAR etc.) are
126    used for a repeated positive single-character class.
127    
128  Both OP_CLASS and OP_NEGCLASS are followed by a 32-byte bit map containing a 1  OP_CLASS is followed by a 32-byte bit map containing a 1
129  bit for every character that is acceptable. The bits are counted from the least  bit for every character that is acceptable. The bits are counted from the least
130  significant end of each byte. The reason for having two opcodes is to cope with  significant end of each byte.
 negated character classes when caseless matching is specified at run time but  
 not at compile time. If it is specified at compile time, the bit map is built  
 appropriately. This is the only time that a distinction is made between  
 OP_CLASS and OP_NEGCLASS, when the bit map was built in a caseful manner but  
 matching must be caseless. For OP_CLASS, a character matches if either of its  
 cases is in the bit map, but for OP_NEGCLASS, both of them must be present.  
131    
132    
133  Back references  Back references
# Line 144  OP_REF is followed by a single byte cont Line 139  OP_REF is followed by a single byte cont
139  Repeating character classes and back references  Repeating character classes and back references
140  -----------------------------------------------  -----------------------------------------------
141    
142  In both cases, the repeat information follows the base item. The matching code  Single-character classes are handled specially (see above). This applies to
143  looks at the following opcode to see if it is one of  OP_CLASS and OP_REF. In both cases, the repeat information follows the base
144    item. The matching code looks at the following opcode to see if it is one of
145    
146    OP_CRSTAR    OP_CRSTAR
147    OP_CRMINSTAR    OP_CRMINSTAR
# Line 201  Forward assertions are just like other s Line 197  Forward assertions are just like other s
197  the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes  the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
198  OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion  OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
199  is OP_REVERSE, followed by a two byte count of the number of characters to move  is OP_REVERSE, followed by a two byte count of the number of characters to move
200  back. A separate count is present in each alternative of a lookbehind  back the pointer in the subject string. A separate count is present in each
201  assertion, allowing them to have different fixed lengths.  alternative of a lookbehind assertion, allowing them to have different fixed
202    lengths.
203    
204    
205  Once-only subpatterns  Once-only subpatterns
# Line 237  the compiled data. Line 234  the compiled data.
234    
235    
236  Philip Hazel  Philip Hazel
237  September 1998  January 1999

Legend:
Removed from v.23  
changed lines
  Added in v.28

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12