| 68 |
I had a flash of inspiration as to how I could run the real compile function in |
I had a flash of inspiration as to how I could run the real compile function in |
| 69 |
a "fake" mode that enables it to compute how much memory it would need, while |
a "fake" mode that enables it to compute how much memory it would need, while |
| 70 |
actually only ever using a few hundred bytes of working memory, and without too |
actually only ever using a few hundred bytes of working memory, and without too |
| 71 |
many tests of the mode that might slow it down. So I re-factored the compiling |
many tests of the mode that might slow it down. So I refactored the compiling |
| 72 |
functions to work this way. This got rid of about 600 lines of source. It |
functions to work this way. This got rid of about 600 lines of source. It |
| 73 |
should make future maintenance and development easier. As this was such a major |
should make future maintenance and development easier. As this was such a major |
| 74 |
change, I never released 6.8, instead upping the number to 7.0 (other quite |
change, I never released 6.8, instead upping the number to 7.0 (other quite |
| 108 |
ever active at once. I believe some other regex matchers work this way. |
ever active at once. I believe some other regex matchers work this way. |
| 109 |
|
|
| 110 |
|
|
| 111 |
|
Changeable options |
| 112 |
|
------------------ |
| 113 |
|
|
| 114 |
|
The /i, /m, or /s options (PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL) may |
| 115 |
|
change in the middle of patterns. From PCRE 8.13, their processing is handled |
| 116 |
|
entirely at compile time by generating different opcodes for the different |
| 117 |
|
settings. The runtime functions do not need to keep track of an options state |
| 118 |
|
any more. |
| 119 |
|
|
| 120 |
|
|
| 121 |
Format of compiled patterns |
Format of compiled patterns |
| 122 |
--------------------------- |
--------------------------- |
| 123 |
|
|
| 134 |
"normal" compilation options. Data values that are counts (e.g. for |
"normal" compilation options. Data values that are counts (e.g. for |
| 135 |
quantifiers) are always just two bytes long. |
quantifiers) are always just two bytes long. |
| 136 |
|
|
|
A list of the opcodes follows: |
|
|
|
|
| 137 |
Opcodes with no following data |
Opcodes with no following data |
| 138 |
------------------------------ |
------------------------------ |
| 139 |
|
|
| 146 |
OP_SOD match start of data: \A |
OP_SOD match start of data: \A |
| 147 |
OP_SOM, start of match (subject + offset): \G |
OP_SOM, start of match (subject + offset): \G |
| 148 |
OP_SET_SOM, set start of match (\K) |
OP_SET_SOM, set start of match (\K) |
| 149 |
OP_CIRC ^ (start of data, or after \n in multiline) |
OP_CIRC ^ (start of data) |
| 150 |
|
OP_CIRCM ^ multiline mode (start of data or after newline) |
| 151 |
OP_NOT_WORD_BOUNDARY \W |
OP_NOT_WORD_BOUNDARY \W |
| 152 |
OP_WORD_BOUNDARY \w |
OP_WORD_BOUNDARY \w |
| 153 |
OP_NOT_DIGIT \D |
OP_NOT_DIGIT \D |
| 162 |
OP_WORDCHAR \w |
OP_WORDCHAR \w |
| 163 |
OP_EODN match end of data or \n at end: \Z |
OP_EODN match end of data or \n at end: \Z |
| 164 |
OP_EOD match end of data: \z |
OP_EOD match end of data: \z |
| 165 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before final newline) |
| 166 |
|
OP_DOLLM $ multiline mode (end of data or before newline) |
| 167 |
OP_EXTUNI match an extended Unicode character |
OP_EXTUNI match an extended Unicode character |
| 168 |
OP_ANYNL match any Unicode newline sequence |
OP_ANYNL match any Unicode newline sequence |
| 169 |
|
|
| 187 |
offset value. |
offset value. |
| 188 |
|
|
| 189 |
|
|
| 190 |
|
Matching literal characters |
| 191 |
|
--------------------------- |
| 192 |
|
|
| 193 |
|
The OP_CHAR opcode is followed by a single character that is to be matched |
| 194 |
|
casefully. For caseless matching, OP_CHARI is used. In UTF-8 mode, the |
| 195 |
|
character may be more than one byte long. (Earlier versions of PCRE used |
| 196 |
|
multi-character strings, but this was changed to allow some new features to be |
| 197 |
|
added.) |
| 198 |
|
|
| 199 |
|
|
| 200 |
Repeating single characters |
Repeating single characters |
| 201 |
--------------------------- |
--------------------------- |
| 202 |
|
|
| 203 |
The common repeats (*, +, ?) when applied to a single character use the |
The common repeats (*, +, ?) when applied to a single character use the |
| 204 |
following opcodes: |
following opcodes, which come in caseful and caseless versions: |
| 205 |
|
|
| 206 |
OP_STAR |
Caseful Caseless |
| 207 |
OP_MINSTAR |
OP_STAR OP_STARI |
| 208 |
OP_POSSTAR |
OP_MINSTAR OP_MINSTARI |
| 209 |
OP_PLUS |
OP_POSSTAR OP_POSSTARI |
| 210 |
OP_MINPLUS |
OP_PLUS OP_PLUSI |
| 211 |
OP_POSPLUS |
OP_MINPLUS OP_MINPLUSI |
| 212 |
OP_QUERY |
OP_POSPLUS OP_POSPLUSI |
| 213 |
OP_MINQUERY |
OP_QUERY OP_QUERYI |
| 214 |
OP_POSQUERY |
OP_MINQUERY OP_MINQUERYI |
| 215 |
|
OP_POSQUERY OP_POSQUERYI |
| 216 |
|
|
| 217 |
In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable. |
In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable. |
| 218 |
Those with "MIN" in their name are the minimizing versions. Those with "POS" in |
Those with "MIN" in their name are the minimizing versions. Those with "POS" in |
| 219 |
their names are possessive versions. Each is followed by the character that is |
their names are possessive versions. Each is followed by the character that is |
| 220 |
to be repeated. Other repeats make use of |
to be repeated. Other repeats make use of these opcodes: |
| 221 |
|
|
| 222 |
OP_UPTO |
Caseful Caseless |
| 223 |
OP_MINUPTO |
OP_UPTO OP_UPTOI |
| 224 |
OP_POSUPTO |
OP_MINUPTO OP_MINUPTOI |
| 225 |
OP_EXACT |
OP_POSUPTO OP_POSUPTOI |
| 226 |
|
OP_EXACT OP_EXACTI |
| 227 |
|
|
| 228 |
which are followed by a two-byte count (most significant first) and the |
Each of these is followed by a two-byte count (most significant first) and the |
| 229 |
repeated character. OP_UPTO matches from 0 to the given number. A repeat with a |
repeated character. OP_UPTO matches from 0 to the given number. A repeat with a |
| 230 |
non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an |
non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an |
| 231 |
OP_UPTO (or OP_MINUPTO or OPT_POSUPTO). |
OP_UPTO (or OP_MINUPTO or OPT_POSUPTO). |
| 266 |
value. |
value. |
| 267 |
|
|
| 268 |
|
|
|
Matching literal characters |
|
|
--------------------------- |
|
|
|
|
|
The OP_CHAR opcode is followed by a single character that is to be matched |
|
|
casefully. For caseless matching, OP_CHARNC is used. In UTF-8 mode, the |
|
|
character may be more than one byte long. (Earlier versions of PCRE used |
|
|
multi-character strings, but this was changed to allow some new features to be |
|
|
added.) |
|
|
|
|
|
|
|
| 269 |
Character classes |
Character classes |
| 270 |
----------------- |
----------------- |
| 271 |
|
|
| 272 |
If there is only one character, OP_CHAR or OP_CHARNC is used for a positive |
If there is only one character, OP_CHAR or OP_CHARI is used for a positive |
| 273 |
class, and OP_NOT for a negative one (that is, for something like [^a]). |
class, and OP_NOT or OP_NOTI for a negative one (that is, for something like |
| 274 |
However, in UTF-8 mode, the use of OP_NOT applies only to characters with |
[^a]). However, in UTF-8 mode, the use of OP_NOT[I] applies only to characters |
| 275 |
values < 128, because OP_NOT is confined to single bytes. |
with values < 128, because OP_NOT[I] is confined to single bytes. |
| 276 |
|
|
| 277 |
Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated, |
Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for a |
| 278 |
negated, single-character class. The normal ones (OP_STAR etc.) are used for a |
repeated, negated, single-character class. The normal single-character opcodes |
| 279 |
repeated positive single-character class. |
(OP_STAR, etc.) are used for a repeated positive single-character class. |
| 280 |
|
|
| 281 |
When there's more than one character in a class and all the characters are less |
When there is more than one character in a class and all the characters are |
| 282 |
than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative |
less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a |
| 283 |
one. In either case, the opcode is followed by a 32-byte bit map containing a 1 |
negative one. In either case, the opcode is followed by a 32-byte bit map |
| 284 |
bit for every character that is acceptable. The bits are counted from the least |
containing a 1 bit for every character that is acceptable. The bits are counted |
| 285 |
significant end of each byte. |
from the least significant end of each byte. In caseless mode, bits for both |
| 286 |
|
cases are set. |
| 287 |
|
|
| 288 |
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, |
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, |
| 289 |
subject characters with values greater than 256 can be handled correctly. For |
subject characters with values greater than 256 can be handled correctly. For |
| 290 |
OP_CLASS they don't match, whereas for OP_NCLASS they do. |
OP_CLASS they do not match, whereas for OP_NCLASS they do. |
| 291 |
|
|
| 292 |
For classes containing characters with values > 255, OP_XCLASS is used. It |
For classes containing characters with values > 255, OP_XCLASS is used. It |
| 293 |
optionally uses a bit map (if any characters lie within it), followed by a list |
optionally uses a bit map (if any characters lie within it), followed by a list |
| 294 |
of pairs and single characters. There is a flag character than indicates |
of pairs (for a range) and single characters. In caseless mode, both cases are |
| 295 |
whether it's a positive or a negative class. |
explicitly listed. There is a flag character than indicates whether it is a |
| 296 |
|
positive or a negative class. |
| 297 |
|
|
| 298 |
|
|
| 299 |
Back references |
Back references |
| 300 |
--------------- |
--------------- |
| 301 |
|
|
| 302 |
OP_REF is followed by two bytes containing the reference number. |
OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes containing the |
| 303 |
|
reference number. |
| 304 |
|
|
| 305 |
|
|
| 306 |
Repeating character classes and back references |
Repeating character classes and back references |
| 307 |
----------------------------------------------- |
----------------------------------------------- |
| 308 |
|
|
| 309 |
Single-character classes are handled specially (see above). This section |
Single-character classes are handled specially (see above). This section |
| 310 |
applies to OP_CLASS and OP_REF. In both cases, the repeat information follows |
applies to OP_CLASS and OP_REF[I]. In both cases, the repeat information |
| 311 |
the base item. The matching code looks at the following opcode to see if it is |
follows the base item. The matching code looks at the following opcode to see |
| 312 |
one of |
if it is one of |
| 313 |
|
|
| 314 |
OP_CRSTAR |
OP_CRSTAR |
| 315 |
OP_CRMINSTAR |
OP_CRMINSTAR |
| 438 |
next item. |
next item. |
| 439 |
|
|
| 440 |
|
|
|
Changing options |
|
|
---------------- |
|
|
|
|
|
If any of the /i, /m, or /s options are changed within a pattern, an OP_OPT |
|
|
opcode is compiled, followed by one byte containing the new settings of these |
|
|
flags. If there are several alternatives, there is an occurrence of OP_OPT at |
|
|
the start of all those following the first options change, to set appropriate |
|
|
options for the start of the alternative. Immediately after the end of the |
|
|
group there is another such item to reset the flags to their previous values. A |
|
|
change of flag right at the very start of the pattern can be handled entirely |
|
|
at compile time, and so does not cause anything to be put into the compiled |
|
|
data. |
|
|
|
|
| 441 |
Philip Hazel |
Philip Hazel |
| 442 |
October 2010 |
May 2011 |