/[pcre]/code/trunk/doc/Tech.Notes
ViewVC logotype

Diff of /code/trunk/doc/Tech.Notes

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 76 by nigel, Sat Feb 24 21:40:37 2007 UTC revision 77 by nigel, Sat Feb 24 21:40:45 2007 UTC
# Line 32  terminology. Line 32  terminology.
32  OK, here's the real stuff  OK, here's the real stuff
33  -------------------------  -------------------------
34    
35  For the set of functions that forms PCRE (which are unrelated to those  For the set of functions that form the "basic" PCRE library (which are
36  mentioned above), I tried at first to invent an algorithm that used an amount  unrelated to those mentioned above), I tried at first to invent an algorithm
37  of store bounded by a multiple of the number of characters in the pattern, to  that used an amount of store bounded by a multiple of the number of characters
38  save on compiling time. However, because of the greater complexity in Perl  in the pattern, to save on compiling time. However, because of the greater
39  regular expressions, I couldn't do this. In any case, a first pass through the  complexity in Perl regular expressions, I couldn't do this. In any case, a
40  pattern is needed, for a number of reasons. PCRE works by running a very  first pass through the pattern is needed, for a number of reasons. PCRE works
41  degenerate first pass to calculate a maximum store size, and then a second pass  by running a very degenerate first pass to calculate a maximum store size, and
42  to do the real compile - which may use a bit less than the predicted amount of  then a second pass to do the real compile - which may use a bit less than the
43  store. The idea is that this is going to turn out faster because the first pass  predicted amount of store. The idea is that this is going to turn out faster
44  is degenerate and the second pass can just store stuff straight into the  because the first pass is degenerate and the second pass can just store stuff
45  vector. It does make the compiling functions bigger, of course, but they have  straight into the vector, which it knows is big enough. It does make the
46  got quite big anyway to handle all the Perl stuff.  compiling functions bigger, of course, but they have got quite big anyway to
47    handle all the Perl stuff.
48    
49    Traditional matching function
50    -----------------------------
51    
52    The "traditional", and original, matching function is called pcre_exec(), and
53    it implements an NFA algorithm, similar to the original Henry Spencer algorithm
54    and the way that Perl works. Not surprising, since it is intended to be as
55    compatible with Perl as possible. This is the function most users of PCRE will
56    use most of the time.
57    
58    Supplementary matching function
59    -------------------------------
60    
61    From PCRE 6.0, there is also a supplementary matching function called
62    pcre_dfa_exec(). This implements a DFA matching algorithm that searches
63    simultaneously for all possible matches that start at one point in the subject
64    string. (Going back to my roots: see Historical Note 1 above.) This function
65    intreprets the same compiled pattern data as pcre_exec(); however, not all the
66    facilities are available, and those that are don't always work in quite the
67    same way. See the user documentation for details.
68    
69    Format of compiled patterns
70    ---------------------------
71    
72  The compiled form of a pattern is a vector of bytes, containing items of  The compiled form of a pattern is a vector of bytes, containing items of
73  variable length. The first byte in an item is an opcode, and the length of the  variable length. The first byte in an item is an opcode, and the length of the
# Line 223  number. This opcode is ignored while mat Line 247  number. This opcode is ignored while mat
247  the bracket itself. (They could have all been done like this, but I was making  the bracket itself. (They could have all been done like this, but I was making
248  minimal changes.)  minimal changes.)
249    
250  A bracket opcode is followed by two bytes which give the offset to the next  A bracket opcode is followed by LINK_SIZE bytes which give the offset to the
251  alternative OP_ALT or, if there aren't any branches, to the matching OP_KET  next alternative OP_ALT or, if there aren't any branches, to the matching
252  opcode. Each OP_ALT is followed by two bytes giving the offset to the next one,  OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
253  or to the OP_KET opcode.  the next one, or to the OP_KET opcode.
254    
255  OP_KET is used for subpatterns that do not repeat indefinitely, while  OP_KET is used for subpatterns that do not repeat indefinitely, while
256  OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or  OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
257  maximally respectively. All three are followed by two bytes giving (as a  maximally respectively. All three are followed by LINK_SIZE bytes giving (as a
258  positive number) the offset back to the matching OP_BRA opcode.  positive number) the offset back to the matching OP_BRA opcode.
259    
260  If a subpattern is quantified such that it is permitted to match zero times, it  If a subpattern is quantified such that it is permitted to match zero times, it
# Line 312  at compile time, and so does not cause a Line 336  at compile time, and so does not cause a
336  data.  data.
337    
338  Philip Hazel  Philip Hazel
339  September 2004  March 2005

Legend:
Removed from v.76  
changed lines
  Added in v.77

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12