| 18 |
By contrast, the code originally written by Henry Spencer and subsequently |
By contrast, the code originally written by Henry Spencer and subsequently |
| 19 |
heavily modified for Perl actually compiles the expression twice: once in a |
heavily modified for Perl actually compiles the expression twice: once in a |
| 20 |
dummy mode in order to find out how much store will be needed, and then for |
dummy mode in order to find out how much store will be needed, and then for |
| 21 |
real. The execution function operates by backtracking and maximizing (or |
real. The execution function operates by backtracking and maximizing (or, |
| 22 |
minimizing in Perl) the amount of the subject that matches individual wild |
optionally, minimizing in Perl) the amount of the subject that matches |
| 23 |
portions of the pattern. This is a "NFA algorithm". |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
| 24 |
|
terminology. |
| 25 |
|
|
| 26 |
For this set of functions, I tried at first to invent an algorithm that used an |
For this set of functions, I tried at first to invent an algorithm that used an |
| 27 |
amount of store bounded by a multiple of the number of characters in the |
amount of store bounded by a multiple of the number of characters in the |
| 28 |
pattern, to save on compiling time. However, because of the greater complexity |
pattern, to save on compiling time. However, because of the greater complexity |
| 29 |
in Perl regular expressions, I couldn't do this. In any case, a first pass |
in Perl regular expressions, I couldn't do this. In any case, a first pass |
| 30 |
through the pattern is needed, in order to find internal flag settings like |
through the pattern is needed, in order to find internal flag settings like |
| 31 |
(?i). So it works by running a very degenerate first pass to calculate a |
(?i) at top level. So it works by running a very degenerate first pass to |
| 32 |
maximum store size, and then a second pass to do the real compile - which may |
calculate a maximum store size, and then a second pass to do the real compile - |
| 33 |
use a bit less than the predicted amount of store. The idea is that this is |
which may use a bit less than the predicted amount of store. The idea is that |
| 34 |
going to turn out faster because the first pass is degenerate and the second |
this is going to turn out faster because the first pass is degenerate and the |
| 35 |
can just store stuff straight into the vector. It does make the compiling |
second can just store stuff straight into the vector. It does make the |
| 36 |
functions bigger, of course, but they have got quite big anyway to handle all |
compiling functions bigger, of course, but they have got quite big anyway to |
| 37 |
the Perl stuff. |
handle all the Perl stuff. |
| 38 |
|
|
| 39 |
The compiled form of a pattern is a vector of bytes, containing items of |
The compiled form of a pattern is a vector of bytes, containing items of |
| 40 |
variable length. The first byte in an item is an opcode, and the length of the |
variable length. The first byte in an item is an opcode, and the length of the |
| 58 |
OP_WHITESPACE \s |
OP_WHITESPACE \s |
| 59 |
OP_NOT_WORDCHAR \W |
OP_NOT_WORDCHAR \W |
| 60 |
OP_WORDCHAR \w |
OP_WORDCHAR \w |
| 61 |
OP_CUT analogue of Prolog's "cut" |
OP_EODN match end of data or \n at end: \Z |
| 62 |
OP_EOD match end of data: \Z |
OP_EOD match end of data: \z |
| 63 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before \n in multiline) |
| 64 |
|
|
| 65 |
|
|
| 197 |
Assertions |
Assertions |
| 198 |
---------- |
---------- |
| 199 |
|
|
| 200 |
Assertions are just like other subpatterns, but starting with one of the |
Forward assertions are just like other subpatterns, but starting with one of |
| 201 |
opcodes OP_ASSERT or OP_ASSERT_NOT. |
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes |
| 202 |
|
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion |
| 203 |
|
is OP_REVERSE, followed by a two byte count of the number of characters to move |
| 204 |
|
back. A separate count is present in each alternative of a lookbehind |
| 205 |
|
assertion, allowing them to have different fixed lengths. |
| 206 |
|
|
| 207 |
|
|
| 208 |
Once-only subpatterns |
Once-only subpatterns |
| 212 |
OP_ONCE. |
OP_ONCE. |
| 213 |
|
|
| 214 |
|
|
| 215 |
|
Conditional subpatterns |
| 216 |
|
----------------------- |
| 217 |
|
|
| 218 |
|
These are like other subpatterns, but they start with the opcode OP_COND. If |
| 219 |
|
the condition is a back reference, this is stored at the start of the |
| 220 |
|
subpattern using the opcode OP_CREF followed by one byte containing the |
| 221 |
|
reference number. Otherwise, a conditional subpattern will always start with |
| 222 |
|
one of the assertions. |
| 223 |
|
|
| 224 |
|
|
| 225 |
|
Changing options |
| 226 |
|
---------------- |
| 227 |
|
|
| 228 |
|
If any of the /i, /m, or /s options are changed within a parenthesized group, |
| 229 |
|
an OP_OPT opcode is compiled, followed by one byte containing the new settings |
| 230 |
|
of these flags. If there are several alternatives in a group, there is an |
| 231 |
|
occurrence of OP_OPT at the start of all those following the first options |
| 232 |
|
change, to set appropriate options for the start of the alternative. |
| 233 |
|
Immediately after the end of the group there is another such item to reset the |
| 234 |
|
flags to their previous values. Other changes of flag within the pattern can be |
| 235 |
|
handled entirely at compile time, and so do not cause anything to be put into |
| 236 |
|
the compiled data. |
| 237 |
|
|
| 238 |
|
|
| 239 |
Philip Hazel |
Philip Hazel |
| 240 |
December 1997 |
September 1998 |