| 32 |
OK, here's the real stuff |
OK, here's the real stuff |
| 33 |
------------------------- |
------------------------- |
| 34 |
|
|
| 35 |
For the set of functions that forms PCRE (which are unrelated to those |
For the set of functions that form the "basic" PCRE library (which are |
| 36 |
mentioned above), I tried at first to invent an algorithm that used an amount |
unrelated to those mentioned above), I tried at first to invent an algorithm |
| 37 |
of store bounded by a multiple of the number of characters in the pattern, to |
that used an amount of store bounded by a multiple of the number of characters |
| 38 |
save on compiling time. However, because of the greater complexity in Perl |
in the pattern, to save on compiling time. However, because of the greater |
| 39 |
regular expressions, I couldn't do this. In any case, a first pass through the |
complexity in Perl regular expressions, I couldn't do this. In any case, a |
| 40 |
pattern is needed, for a number of reasons. PCRE works by running a very |
first pass through the pattern is needed, for a number of reasons. PCRE works |
| 41 |
degenerate first pass to calculate a maximum store size, and then a second pass |
by running a very degenerate first pass to calculate a maximum store size, and |
| 42 |
to do the real compile - which may use a bit less than the predicted amount of |
then a second pass to do the real compile - which may use a bit less than the |
| 43 |
store. The idea is that this is going to turn out faster because the first pass |
predicted amount of store. The idea is that this is going to turn out faster |
| 44 |
is degenerate and the second pass can just store stuff straight into the |
because the first pass is degenerate and the second pass can just store stuff |
| 45 |
vector. It does make the compiling functions bigger, of course, but they have |
straight into the vector, which it knows is big enough. It does make the |
| 46 |
got quite big anyway to handle all the Perl stuff. |
compiling functions bigger, of course, but they have got quite big anyway to |
| 47 |
|
handle all the Perl stuff. |
| 48 |
|
|
| 49 |
|
Traditional matching function |
| 50 |
|
----------------------------- |
| 51 |
|
|
| 52 |
|
The "traditional", and original, matching function is called pcre_exec(), and |
| 53 |
|
it implements an NFA algorithm, similar to the original Henry Spencer algorithm |
| 54 |
|
and the way that Perl works. Not surprising, since it is intended to be as |
| 55 |
|
compatible with Perl as possible. This is the function most users of PCRE will |
| 56 |
|
use most of the time. |
| 57 |
|
|
| 58 |
|
Supplementary matching function |
| 59 |
|
------------------------------- |
| 60 |
|
|
| 61 |
|
From PCRE 6.0, there is also a supplementary matching function called |
| 62 |
|
pcre_dfa_exec(). This implements a DFA matching algorithm that searches |
| 63 |
|
simultaneously for all possible matches that start at one point in the subject |
| 64 |
|
string. (Going back to my roots: see Historical Note 1 above.) This function |
| 65 |
|
intreprets the same compiled pattern data as pcre_exec(); however, not all the |
| 66 |
|
facilities are available, and those that are don't always work in quite the |
| 67 |
|
same way. See the user documentation for details. |
| 68 |
|
|
| 69 |
|
Format of compiled patterns |
| 70 |
|
--------------------------- |
| 71 |
|
|
| 72 |
The compiled form of a pattern is a vector of bytes, containing items of |
The compiled form of a pattern is a vector of bytes, containing items of |
| 73 |
variable length. The first byte in an item is an opcode, and the length of the |
variable length. The first byte in an item is an opcode, and the length of the |
| 247 |
the bracket itself. (They could have all been done like this, but I was making |
the bracket itself. (They could have all been done like this, but I was making |
| 248 |
minimal changes.) |
minimal changes.) |
| 249 |
|
|
| 250 |
A bracket opcode is followed by two bytes which give the offset to the next |
A bracket opcode is followed by LINK_SIZE bytes which give the offset to the |
| 251 |
alternative OP_ALT or, if there aren't any branches, to the matching OP_KET |
next alternative OP_ALT or, if there aren't any branches, to the matching |
| 252 |
opcode. Each OP_ALT is followed by two bytes giving the offset to the next one, |
OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to |
| 253 |
or to the OP_KET opcode. |
the next one, or to the OP_KET opcode. |
| 254 |
|
|
| 255 |
OP_KET is used for subpatterns that do not repeat indefinitely, while |
OP_KET is used for subpatterns that do not repeat indefinitely, while |
| 256 |
OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or |
OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or |
| 257 |
maximally respectively. All three are followed by two bytes giving (as a |
maximally respectively. All three are followed by LINK_SIZE bytes giving (as a |
| 258 |
positive number) the offset back to the matching OP_BRA opcode. |
positive number) the offset back to the matching OP_BRA opcode. |
| 259 |
|
|
| 260 |
If a subpattern is quantified such that it is permitted to match zero times, it |
If a subpattern is quantified such that it is permitted to match zero times, it |
| 336 |
data. |
data. |
| 337 |
|
|
| 338 |
Philip Hazel |
Philip Hazel |
| 339 |
September 2004 |
March 2005 |