| 18 |
By contrast, the code originally written by Henry Spencer and subsequently |
By contrast, the code originally written by Henry Spencer and subsequently |
| 19 |
heavily modified for Perl actually compiles the expression twice: once in a |
heavily modified for Perl actually compiles the expression twice: once in a |
| 20 |
dummy mode in order to find out how much store will be needed, and then for |
dummy mode in order to find out how much store will be needed, and then for |
| 21 |
real. The execution function operates by backtracking and maximizing (or |
real. The execution function operates by backtracking and maximizing (or, |
| 22 |
minimizing in Perl) the amount of the subject that matches individual wild |
optionally, minimizing in Perl) the amount of the subject that matches |
| 23 |
portions of the pattern. This is a "NFA algorithm". |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
| 24 |
|
terminology. |
| 25 |
For this set of functions, I tried at first to invent an algorithm that used an |
|
| 26 |
amount of store bounded by a multiple of the number of characters in the |
For this set of functions that forms PCRE, I tried at first to invent an |
| 27 |
pattern, to save on compiling time. However, because of the greater complexity |
algorithm that used an amount of store bounded by a multiple of the number of |
| 28 |
in Perl regular expressions, I couldn't do this. In any case, a first pass |
characters in the pattern, to save on compiling time. However, because of the |
| 29 |
through the pattern is needed, in order to find internal flag settings like |
greater complexity in Perl regular expressions, I couldn't do this. In any |
| 30 |
(?i). So it works by running a very degenerate first pass to calculate a |
case, a first pass through the pattern is needed, in order to find internal |
| 31 |
maximum store size, and then a second pass to do the real compile - which may |
flag settings like (?i) at top level. So it works by running a very degenerate |
| 32 |
use a bit less than the predicted amount of store. The idea is that this is |
first pass to calculate a maximum store size, and then a second pass to do the |
| 33 |
going to turn out faster because the first pass is degenerate and the second |
real compile - which may use a bit less than the predicted amount of store. The |
| 34 |
can just store stuff straight into the vector. It does make the compiling |
idea is that this is going to turn out faster because the first pass is |
| 35 |
functions bigger, of course, but they have got quite big anyway to handle all |
degenerate and the second can just store stuff straight into the vector. It |
| 36 |
the Perl stuff. |
does make the compiling functions bigger, of course, but they have got quite |
| 37 |
|
big anyway to handle all the Perl stuff. |
| 38 |
|
|
| 39 |
The compiled form of a pattern is a vector of bytes, containing items of |
The compiled form of a pattern is a vector of bytes, containing items of |
| 40 |
variable length. The first byte in an item is an opcode, and the length of the |
variable length. The first byte in an item is an opcode, and the length of the |
| 58 |
OP_WHITESPACE \s |
OP_WHITESPACE \s |
| 59 |
OP_NOT_WORDCHAR \W |
OP_NOT_WORDCHAR \W |
| 60 |
OP_WORDCHAR \w |
OP_WORDCHAR \w |
| 61 |
OP_CUT analogue of Prolog's "cut" |
OP_EODN match end of data or \n at end: \Z |
| 62 |
OP_EOD match end of data: \Z |
OP_EOD match end of data: \z |
| 63 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before \n in multiline) |
| 64 |
|
|
| 65 |
|
|
| 118 |
Character classes |
Character classes |
| 119 |
----------------- |
----------------- |
| 120 |
|
|
| 121 |
OP_CLASS is used for a character class, and OP_NEGCLASS for a negated character |
OP_CLASS is used for a character class, provided there are at least two |
| 122 |
class, provided there are at least two characters in the class. If there is |
characters in the class. If there is only one character, OP_CHARS is used for a |
| 123 |
only one character, OP_CHARS is used for a positive class, and OP_NOT for a |
positive class, and OP_NOT for a negative one (that is, for something like |
| 124 |
negative one. A set of repeating opcodes (OP_NOTSTAR etc.) are used for a |
[^a]). Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a |
| 125 |
repeated, negated, single-character class. |
repeated, negated, single-character class. The normal ones (OP_STAR etc.) are |
| 126 |
|
used for a repeated positive single-character class. |
| 127 |
|
|
| 128 |
Both OP_CLASS and OP_NEGCLASS are followed by a 32-byte bit map containing a 1 |
OP_CLASS is followed by a 32-byte bit map containing a 1 |
| 129 |
bit for every character that is acceptable. The bits are counted from the least |
bit for every character that is acceptable. The bits are counted from the least |
| 130 |
significant end of each byte. The reason for having two opcodes is to cope with |
significant end of each byte. |
|
negated character classes when caseless matching is specified at run time but |
|
|
not at compile time. If it is specified at compile time, the bit map is built |
|
|
appropriately. This is the only time that a distinction is made between |
|
|
OP_CLASS and OP_NEGCLASS, when the bit map was built in a caseful manner but |
|
|
matching must be caseless. For OP_CLASS, a character matches if either of its |
|
|
cases is in the bit map, but for OP_NEGCLASS, both of them must be present. |
|
| 131 |
|
|
| 132 |
|
|
| 133 |
Back references |
Back references |
| 139 |
Repeating character classes and back references |
Repeating character classes and back references |
| 140 |
----------------------------------------------- |
----------------------------------------------- |
| 141 |
|
|
| 142 |
In both cases, the repeat information follows the base item. The matching code |
Single-character classes are handled specially (see above). This applies to |
| 143 |
looks at the following opcode to see if it is one of |
OP_CLASS and OP_REF. In both cases, the repeat information follows the base |
| 144 |
|
item. The matching code looks at the following opcode to see if it is one of |
| 145 |
|
|
| 146 |
OP_CRSTAR |
OP_CRSTAR |
| 147 |
OP_CRMINSTAR |
OP_CRMINSTAR |
| 193 |
Assertions |
Assertions |
| 194 |
---------- |
---------- |
| 195 |
|
|
| 196 |
Assertions are just like other subpatterns, but starting with one of the |
Forward assertions are just like other subpatterns, but starting with one of |
| 197 |
opcodes OP_ASSERT or OP_ASSERT_NOT. |
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes |
| 198 |
|
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion |
| 199 |
|
is OP_REVERSE, followed by a two byte count of the number of characters to move |
| 200 |
|
back the pointer in the subject string. A separate count is present in each |
| 201 |
|
alternative of a lookbehind assertion, allowing them to have different fixed |
| 202 |
|
lengths. |
| 203 |
|
|
| 204 |
|
|
| 205 |
Once-only subpatterns |
Once-only subpatterns |
| 209 |
OP_ONCE. |
OP_ONCE. |
| 210 |
|
|
| 211 |
|
|
| 212 |
|
Conditional subpatterns |
| 213 |
|
----------------------- |
| 214 |
|
|
| 215 |
|
These are like other subpatterns, but they start with the opcode OP_COND. If |
| 216 |
|
the condition is a back reference, this is stored at the start of the |
| 217 |
|
subpattern using the opcode OP_CREF followed by one byte containing the |
| 218 |
|
reference number. Otherwise, a conditional subpattern will always start with |
| 219 |
|
one of the assertions. |
| 220 |
|
|
| 221 |
|
|
| 222 |
|
Changing options |
| 223 |
|
---------------- |
| 224 |
|
|
| 225 |
|
If any of the /i, /m, or /s options are changed within a parenthesized group, |
| 226 |
|
an OP_OPT opcode is compiled, followed by one byte containing the new settings |
| 227 |
|
of these flags. If there are several alternatives in a group, there is an |
| 228 |
|
occurrence of OP_OPT at the start of all those following the first options |
| 229 |
|
change, to set appropriate options for the start of the alternative. |
| 230 |
|
Immediately after the end of the group there is another such item to reset the |
| 231 |
|
flags to their previous values. Other changes of flag within the pattern can be |
| 232 |
|
handled entirely at compile time, and so do not cause anything to be put into |
| 233 |
|
the compiled data. |
| 234 |
|
|
| 235 |
|
|
| 236 |
Philip Hazel |
Philip Hazel |
| 237 |
December 1997 |
January 1999 |