| 23 |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
| 24 |
terminology. |
terminology. |
| 25 |
|
|
| 26 |
For this set of functions that forms PCRE, I tried at first to invent an |
For the set of functions that forms PCRE (which are unrelated to those |
| 27 |
algorithm that used an amount of store bounded by a multiple of the number of |
mentioned above), I tried at first to invent an algorithm that used an amount |
| 28 |
characters in the pattern, to save on compiling time. However, because of the |
of store bounded by a multiple of the number of characters in the pattern, to |
| 29 |
greater complexity in Perl regular expressions, I couldn't do this. In any |
save on compiling time. However, because of the greater complexity in Perl |
| 30 |
case, a first pass through the pattern is needed, in order to find internal |
regular expressions, I couldn't do this. In any case, a first pass through the |
| 31 |
flag settings like (?i) at top level. So it works by running a very degenerate |
pattern is needed, in order to find internal flag settings like (?i) at top |
| 32 |
first pass to calculate a maximum store size, and then a second pass to do the |
level. So PCRE works by running a very degenerate first pass to calculate a |
| 33 |
real compile - which may use a bit less than the predicted amount of store. The |
maximum store size, and then a second pass to do the real compile - which may |
| 34 |
idea is that this is going to turn out faster because the first pass is |
use a bit less than the predicted amount of store. The idea is that this is |
| 35 |
degenerate and the second can just store stuff straight into the vector. It |
going to turn out faster because the first pass is degenerate and the second |
| 36 |
does make the compiling functions bigger, of course, but they have got quite |
pass can just store stuff straight into the vector. It does make the compiling |
| 37 |
big anyway to handle all the Perl stuff. |
functions bigger, of course, but they have got quite big anyway to handle all |
| 38 |
|
the Perl stuff. |
| 39 |
|
|
| 40 |
The compiled form of a pattern is a vector of bytes, containing items of |
The compiled form of a pattern is a vector of bytes, containing items of |
| 41 |
variable length. The first byte in an item is an opcode, and the length of the |
variable length. The first byte in an item is an opcode, and the length of the |
| 62 |
OP_EODN match end of data or \n at end: \Z |
OP_EODN match end of data or \n at end: \Z |
| 63 |
OP_EOD match end of data: \z |
OP_EOD match end of data: \z |
| 64 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before \n in multiline) |
| 65 |
|
OP_RECURSE match the pattern recursively |
| 66 |
|
|
| 67 |
|
|
| 68 |
Repeating single characters |
Repeating single characters |
| 127 |
repeated, negated, single-character class. The normal ones (OP_STAR etc.) are |
repeated, negated, single-character class. The normal ones (OP_STAR etc.) are |
| 128 |
used for a repeated positive single-character class. |
used for a repeated positive single-character class. |
| 129 |
|
|
| 130 |
OP_CLASS is followed by a 32-byte bit map containing a 1 |
OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every |
| 131 |
bit for every character that is acceptable. The bits are counted from the least |
character that is acceptable. The bits are counted from the least significant |
| 132 |
significant end of each byte. |
end of each byte. |
| 133 |
|
|
| 134 |
|
|
| 135 |
Back references |
Back references |
| 136 |
--------------- |
--------------- |
| 137 |
|
|
| 138 |
OP_REF is followed by a single byte containing the reference number. |
OP_REF is followed by two bytes containing the reference number. |
| 139 |
|
|
| 140 |
|
|
| 141 |
Repeating character classes and back references |
Repeating character classes and back references |
| 161 |
Brackets and alternation |
Brackets and alternation |
| 162 |
------------------------ |
------------------------ |
| 163 |
|
|
| 164 |
A pair of non-identifying (round) brackets is wrapped round each expression at |
A pair of non-capturing (round) brackets is wrapped round each expression at |
| 165 |
compile time, so alternation always happens in the context of brackets. |
compile time, so alternation always happens in the context of brackets. |
| 166 |
Non-identifying brackets use the opcode OP_BRA, while identifying brackets use |
|
| 167 |
|
Non-capturing brackets use the opcode OP_BRA, while capturing brackets use |
| 168 |
OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English |
OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English |
| 169 |
speakers, including myself, can be round, square, or curly. Hence this usage.] |
speakers, including myself, can be round, square, curly, or pointy. Hence this |
| 170 |
|
usage.] |
| 171 |
|
|
| 172 |
|
Originally PCRE was limited to 99 capturing brackets (so as not to use up all |
| 173 |
|
the opcodes). From release 3.5, there is no limit. What happens is that the |
| 174 |
|
first ones, up to EXTRACT_BASIC_MAX are handled with separate opcodes, as |
| 175 |
|
above. If there are more, the opcode is set to EXTRACT_BASIC_MAX+1, and the |
| 176 |
|
first operation in the bracket is OP_BRANUMBER, followed by a 2-byte bracket |
| 177 |
|
number. This opcode is ignored while matching, but is fished out when handling |
| 178 |
|
the bracket itself. (They could have all been done like this, but I was making |
| 179 |
|
minimal changes.) |
| 180 |
|
|
| 181 |
A bracket opcode is followed by two bytes which give the offset to the next |
A bracket opcode is followed by two bytes which give the offset to the next |
| 182 |
alternative OP_ALT or, if there aren't any branches, to the matching KET |
alternative OP_ALT or, if there aren't any branches, to the matching KET |
| 201 |
A subpattern with a bounded maximum repetition is replicated in a nested |
A subpattern with a bounded maximum repetition is replicated in a nested |
| 202 |
fashion up to the maximum number of times, with BRAZERO or BRAMINZERO before |
fashion up to the maximum number of times, with BRAZERO or BRAMINZERO before |
| 203 |
each replication after the minimum, so that, for example, (abc){2,5} is |
each replication after the minimum, so that, for example, (abc){2,5} is |
| 204 |
compiled as (abc)(abc)((abc)((abc)(abc)?)?)?. The 200-bracket limit does not |
compiled as (abc)(abc)((abc)((abc)(abc)?)?)?. The 99 and 200 bracket limits do |
| 205 |
apply to these internally generated brackets. |
not apply to these internally generated brackets. |
| 206 |
|
|
| 207 |
|
|
| 208 |
Assertions |
Assertions |
| 212 |
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes |
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes |
| 213 |
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion |
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion |
| 214 |
is OP_REVERSE, followed by a two byte count of the number of characters to move |
is OP_REVERSE, followed by a two byte count of the number of characters to move |
| 215 |
back the pointer in the subject string. A separate count is present in each |
back the pointer in the subject string. When operating in UTF-8 mode, the count |
| 216 |
alternative of a lookbehind assertion, allowing them to have different fixed |
is a character count rather than a byte count. A separate count is present in |
| 217 |
lengths. |
each alternative of a lookbehind assertion, allowing them to have different |
| 218 |
|
fixed lengths. |
| 219 |
|
|
| 220 |
|
|
| 221 |
Once-only subpatterns |
Once-only subpatterns |
| 230 |
|
|
| 231 |
These are like other subpatterns, but they start with the opcode OP_COND. If |
These are like other subpatterns, but they start with the opcode OP_COND. If |
| 232 |
the condition is a back reference, this is stored at the start of the |
the condition is a back reference, this is stored at the start of the |
| 233 |
subpattern using the opcode OP_CREF followed by one byte containing the |
subpattern using the opcode OP_CREF followed by two bytes containing the |
| 234 |
reference number. Otherwise, a conditional subpattern will always start with |
reference number. Otherwise, a conditional subpattern will always start with |
| 235 |
one of the assertions. |
one of the assertions. |
| 236 |
|
|
| 250 |
|
|
| 251 |
|
|
| 252 |
Philip Hazel |
Philip Hazel |
| 253 |
January 1999 |
August 2001 |