| 23 |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
| 24 |
terminology. |
terminology. |
| 25 |
|
|
| 26 |
For this set of functions that forms PCRE, I tried at first to invent an |
For the set of functions that forms PCRE (which are unrelated to those |
| 27 |
algorithm that used an amount of store bounded by a multiple of the number of |
mentioned above), I tried at first to invent an algorithm that used an amount |
| 28 |
characters in the pattern, to save on compiling time. However, because of the |
of store bounded by a multiple of the number of characters in the pattern, to |
| 29 |
greater complexity in Perl regular expressions, I couldn't do this. In any |
save on compiling time. However, because of the greater complexity in Perl |
| 30 |
case, a first pass through the pattern is needed, in order to find internal |
regular expressions, I couldn't do this. In any case, a first pass through the |
| 31 |
flag settings like (?i) at top level. So it works by running a very degenerate |
pattern is needed, in order to find internal flag settings like (?i) at top |
| 32 |
first pass to calculate a maximum store size, and then a second pass to do the |
level. So PCRE works by running a very degenerate first pass to calculate a |
| 33 |
real compile - which may use a bit less than the predicted amount of store. The |
maximum store size, and then a second pass to do the real compile - which may |
| 34 |
idea is that this is going to turn out faster because the first pass is |
use a bit less than the predicted amount of store. The idea is that this is |
| 35 |
degenerate and the second can just store stuff straight into the vector. It |
going to turn out faster because the first pass is degenerate and the second |
| 36 |
does make the compiling functions bigger, of course, but they have got quite |
pass can just store stuff straight into the vector. It does make the compiling |
| 37 |
big anyway to handle all the Perl stuff. |
functions bigger, of course, but they have got quite big anyway to handle all |
| 38 |
|
the Perl stuff. |
| 39 |
|
|
| 40 |
The compiled form of a pattern is a vector of bytes, containing items of |
The compiled form of a pattern is a vector of bytes, containing items of |
| 41 |
variable length. The first byte in an item is an opcode, and the length of the |
variable length. The first byte in an item is an opcode, and the length of the |
| 62 |
OP_EODN match end of data or \n at end: \Z |
OP_EODN match end of data or \n at end: \Z |
| 63 |
OP_EOD match end of data: \z |
OP_EOD match end of data: \z |
| 64 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before \n in multiline) |
| 65 |
|
OP_RECURSE match the pattern recursively |
| 66 |
|
|
| 67 |
|
|
| 68 |
Repeating single characters |
Repeating single characters |
| 127 |
repeated, negated, single-character class. The normal ones (OP_STAR etc.) are |
repeated, negated, single-character class. The normal ones (OP_STAR etc.) are |
| 128 |
used for a repeated positive single-character class. |
used for a repeated positive single-character class. |
| 129 |
|
|
| 130 |
OP_CLASS is followed by a 32-byte bit map containing a 1 |
OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every |
| 131 |
bit for every character that is acceptable. The bits are counted from the least |
character that is acceptable. The bits are counted from the least significant |
| 132 |
significant end of each byte. |
end of each byte. |
| 133 |
|
|
| 134 |
|
|
| 135 |
Back references |
Back references |
| 161 |
Brackets and alternation |
Brackets and alternation |
| 162 |
------------------------ |
------------------------ |
| 163 |
|
|
| 164 |
A pair of non-identifying (round) brackets is wrapped round each expression at |
A pair of non-capturing (round) brackets is wrapped round each expression at |
| 165 |
compile time, so alternation always happens in the context of brackets. |
compile time, so alternation always happens in the context of brackets. |
| 166 |
Non-identifying brackets use the opcode OP_BRA, while identifying brackets use |
Non-capturing brackets use the opcode OP_BRA, while capturing brackets use |
| 167 |
OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English |
OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English |
| 168 |
speakers, including myself, can be round, square, or curly. Hence this usage.] |
speakers, including myself, can be round, square, curly, or pointy. Hence this |
| 169 |
|
usage.] |
| 170 |
|
|
| 171 |
A bracket opcode is followed by two bytes which give the offset to the next |
A bracket opcode is followed by two bytes which give the offset to the next |
| 172 |
alternative OP_ALT or, if there aren't any branches, to the matching KET |
alternative OP_ALT or, if there aren't any branches, to the matching KET |
| 239 |
|
|
| 240 |
|
|
| 241 |
Philip Hazel |
Philip Hazel |
| 242 |
January 1999 |
February 2000 |