| 2 |
-------------------------- |
-------------------------- |
| 3 |
|
|
| 4 |
These are very rough technical notes that record potentially useful information |
These are very rough technical notes that record potentially useful information |
| 5 |
about PCRE internals. |
about PCRE internals. For information about testing PCRE, see the pcretest |
| 6 |
|
documentation and the comment at the head of the RunTest file. |
| 7 |
|
|
| 8 |
|
|
| 9 |
Historical note 1 |
Historical note 1 |
| 10 |
----------------- |
----------------- |
| 24 |
not necessarily maximize the individual wild portions of the pattern, as is |
not necessarily maximize the individual wild portions of the pattern, as is |
| 25 |
expected in Unix and Perl-style regular expressions. |
expected in Unix and Perl-style regular expressions. |
| 26 |
|
|
| 27 |
|
|
| 28 |
Historical note 2 |
Historical note 2 |
| 29 |
----------------- |
----------------- |
| 30 |
|
|
| 37 |
matches individual wild portions of the pattern. This is an "NFA algorithm" in |
matches individual wild portions of the pattern. This is an "NFA algorithm" in |
| 38 |
Friedl's terminology. |
Friedl's terminology. |
| 39 |
|
|
| 40 |
|
|
| 41 |
OK, here's the real stuff |
OK, here's the real stuff |
| 42 |
------------------------- |
------------------------- |
| 43 |
|
|
| 48 |
complexity in Perl regular expressions, I couldn't do this. In any case, a |
complexity in Perl regular expressions, I couldn't do this. In any case, a |
| 49 |
first pass through the pattern is helpful for other reasons. |
first pass through the pattern is helpful for other reasons. |
| 50 |
|
|
| 51 |
|
|
| 52 |
Computing the memory requirement: how it was |
Computing the memory requirement: how it was |
| 53 |
-------------------------------------------- |
-------------------------------------------- |
| 54 |
|
|
| 59 |
the first pass is degenerate and the second pass can just store stuff straight |
the first pass is degenerate and the second pass can just store stuff straight |
| 60 |
into the vector, which it knows is big enough. |
into the vector, which it knows is big enough. |
| 61 |
|
|
| 62 |
|
|
| 63 |
Computing the memory requirement: how it is |
Computing the memory requirement: how it is |
| 64 |
------------------------------------------- |
------------------------------------------- |
| 65 |
|
|
| 69 |
I had a flash of inspiration as to how I could run the real compile function in |
I had a flash of inspiration as to how I could run the real compile function in |
| 70 |
a "fake" mode that enables it to compute how much memory it would need, while |
a "fake" mode that enables it to compute how much memory it would need, while |
| 71 |
actually only ever using a few hundred bytes of working memory, and without too |
actually only ever using a few hundred bytes of working memory, and without too |
| 72 |
many tests of the mode that might slow it down. So I re-factored the compiling |
many tests of the mode that might slow it down. So I refactored the compiling |
| 73 |
functions to work this way. This got rid of about 600 lines of source. It |
functions to work this way. This got rid of about 600 lines of source. It |
| 74 |
should make future maintenance and development easier. As this was such a major |
should make future maintenance and development easier. As this was such a major |
| 75 |
change, I never released 6.8, instead upping the number to 7.0 (other quite |
change, I never released 6.8, instead upping the number to 7.0 (other quite |
| 81 |
is doing a full analysis of the pattern. My hope was that this would not be a |
is doing a full analysis of the pattern. My hope was that this would not be a |
| 82 |
big issue, and in the event, nobody has commented on it. |
big issue, and in the event, nobody has commented on it. |
| 83 |
|
|
| 84 |
|
|
| 85 |
Traditional matching function |
Traditional matching function |
| 86 |
----------------------------- |
----------------------------- |
| 87 |
|
|
| 89 |
it implements an NFA algorithm, similar to the original Henry Spencer algorithm |
it implements an NFA algorithm, similar to the original Henry Spencer algorithm |
| 90 |
and the way that Perl works. This is not surprising, since it is intended to be |
and the way that Perl works. This is not surprising, since it is intended to be |
| 91 |
as compatible with Perl as possible. This is the function most users of PCRE |
as compatible with Perl as possible. This is the function most users of PCRE |
| 92 |
will use most of the time. |
will use most of the time. From release 8.20, if PCRE is compiled with |
| 93 |
|
just-in-time (JIT) support, and studying a compiled pattern with JIT is |
| 94 |
|
successful, the JIT code is run instead of the normal pcre_exec() code, but the |
| 95 |
|
result is the same. |
| 96 |
|
|
| 97 |
|
|
| 98 |
Supplementary matching function |
Supplementary matching function |
| 99 |
------------------------------- |
------------------------------- |
| 112 |
ever active at once. I believe some other regex matchers work this way. |
ever active at once. I believe some other regex matchers work this way. |
| 113 |
|
|
| 114 |
|
|
| 115 |
|
Changeable options |
| 116 |
|
------------------ |
| 117 |
|
|
| 118 |
|
The /i, /m, or /s options (PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL) may |
| 119 |
|
change in the middle of patterns. From PCRE 8.13, their processing is handled |
| 120 |
|
entirely at compile time by generating different opcodes for the different |
| 121 |
|
settings. The runtime functions do not need to keep track of an options state |
| 122 |
|
any more. |
| 123 |
|
|
| 124 |
|
|
| 125 |
Format of compiled patterns |
Format of compiled patterns |
| 126 |
--------------------------- |
--------------------------- |
| 127 |
|
|
| 138 |
"normal" compilation options. Data values that are counts (e.g. for |
"normal" compilation options. Data values that are counts (e.g. for |
| 139 |
quantifiers) are always just two bytes long. |
quantifiers) are always just two bytes long. |
| 140 |
|
|
|
A list of the opcodes follows: |
|
|
|
|
|
|
|
| 141 |
Opcodes with no following data |
Opcodes with no following data |
| 142 |
------------------------------ |
------------------------------ |
| 143 |
|
|
| 150 |
OP_SOD match start of data: \A |
OP_SOD match start of data: \A |
| 151 |
OP_SOM, start of match (subject + offset): \G |
OP_SOM, start of match (subject + offset): \G |
| 152 |
OP_SET_SOM, set start of match (\K) |
OP_SET_SOM, set start of match (\K) |
| 153 |
OP_CIRC ^ (start of data, or after \n in multiline) |
OP_CIRC ^ (start of data) |
| 154 |
|
OP_CIRCM ^ multiline mode (start of data or after newline) |
| 155 |
OP_NOT_WORD_BOUNDARY \W |
OP_NOT_WORD_BOUNDARY \W |
| 156 |
OP_WORD_BOUNDARY \w |
OP_WORD_BOUNDARY \w |
| 157 |
OP_NOT_DIGIT \D |
OP_NOT_DIGIT \D |
| 166 |
OP_WORDCHAR \w |
OP_WORDCHAR \w |
| 167 |
OP_EODN match end of data or \n at end: \Z |
OP_EODN match end of data or \n at end: \Z |
| 168 |
OP_EOD match end of data: \z |
OP_EOD match end of data: \z |
| 169 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before final newline) |
| 170 |
|
OP_DOLLM $ multiline mode (end of data or before newline) |
| 171 |
OP_EXTUNI match an extended Unicode character |
OP_EXTUNI match an extended Unicode character |
| 172 |
OP_ANYNL match any Unicode newline sequence |
OP_ANYNL match any Unicode newline sequence |
| 173 |
|
|
| 174 |
OP_ACCEPT ) These are Perl 5.10's "backtracking |
OP_ACCEPT ) These are Perl 5.10's "backtracking control |
| 175 |
OP_COMMIT ) control verbs". If OP_ACCEPT is inside |
OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing |
| 176 |
OP_FAIL ) capturing parentheses, it may be preceded |
OP_FAIL ) parentheses, it may be preceded by one or more |
| 177 |
OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte |
OP_PRUNE ) OP_CLOSE, followed by a 2-byte number, |
| 178 |
OP_SKIP ) number, indicating which parentheses must be |
OP_SKIP ) indicating which parentheses must be closed. |
| 179 |
OP_THEN ) closed. |
|
| 180 |
|
|
| 181 |
|
Backtracking control verbs with data |
| 182 |
|
------------------------------------ |
| 183 |
|
|
| 184 |
|
OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the |
| 185 |
|
start of the current branch. |
| 186 |
|
|
| 187 |
|
OP_MARK is followed by the mark name, preceded by a one-byte length, and |
| 188 |
|
followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments, |
| 189 |
|
the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first |
| 190 |
|
two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE |
| 191 |
|
offset value. |
| 192 |
|
|
| 193 |
|
|
| 194 |
|
Matching literal characters |
| 195 |
|
--------------------------- |
| 196 |
|
|
| 197 |
|
The OP_CHAR opcode is followed by a single character that is to be matched |
| 198 |
|
casefully. For caseless matching, OP_CHARI is used. In UTF-8 mode, the |
| 199 |
|
character may be more than one byte long. (Earlier versions of PCRE used |
| 200 |
|
multi-character strings, but this was changed to allow some new features to be |
| 201 |
|
added.) |
| 202 |
|
|
| 203 |
|
|
| 204 |
Repeating single characters |
Repeating single characters |
| 205 |
--------------------------- |
--------------------------- |
| 206 |
|
|
| 207 |
The common repeats (*, +, ?) when applied to a single character use the |
The common repeats (*, +, ?) when applied to a single character use the |
| 208 |
following opcodes: |
following opcodes, which come in caseful and caseless versions: |
| 209 |
|
|
| 210 |
OP_STAR |
Caseful Caseless |
| 211 |
OP_MINSTAR |
OP_STAR OP_STARI |
| 212 |
OP_POSSTAR |
OP_MINSTAR OP_MINSTARI |
| 213 |
OP_PLUS |
OP_POSSTAR OP_POSSTARI |
| 214 |
OP_MINPLUS |
OP_PLUS OP_PLUSI |
| 215 |
OP_POSPLUS |
OP_MINPLUS OP_MINPLUSI |
| 216 |
OP_QUERY |
OP_POSPLUS OP_POSPLUSI |
| 217 |
OP_MINQUERY |
OP_QUERY OP_QUERYI |
| 218 |
OP_POSQUERY |
OP_MINQUERY OP_MINQUERYI |
| 219 |
|
OP_POSQUERY OP_POSQUERYI |
| 220 |
|
|
| 221 |
In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable. |
In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable. |
| 222 |
Those with "MIN" in their name are the minimizing versions. Those with "POS" in |
Those with "MIN" in their name are the minimizing versions. Those with "POS" in |
| 223 |
their names are possessive versions. Each is followed by the character that is |
their names are possessive versions. Each is followed by the character that is |
| 224 |
to be repeated. Other repeats make use of |
to be repeated. Other repeats make use of these opcodes: |
| 225 |
|
|
| 226 |
OP_UPTO |
Caseful Caseless |
| 227 |
OP_MINUPTO |
OP_UPTO OP_UPTOI |
| 228 |
OP_POSUPTO |
OP_MINUPTO OP_MINUPTOI |
| 229 |
OP_EXACT |
OP_POSUPTO OP_POSUPTOI |
| 230 |
|
OP_EXACT OP_EXACTI |
| 231 |
|
|
| 232 |
which are followed by a two-byte count (most significant first) and the |
Each of these is followed by a two-byte count (most significant first) and the |
| 233 |
repeated character. OP_UPTO matches from 0 to the given number. A repeat with a |
repeated character. OP_UPTO matches from 0 to the given number. A repeat with a |
| 234 |
non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an |
non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an |
| 235 |
OP_UPTO (or OP_MINUPTO or OPT_POSUPTO). |
OP_UPTO (or OP_MINUPTO or OPT_POSUPTO). |
| 270 |
value. |
value. |
| 271 |
|
|
| 272 |
|
|
|
Matching literal characters |
|
|
--------------------------- |
|
|
|
|
|
The OP_CHAR opcode is followed by a single character that is to be matched |
|
|
casefully. For caseless matching, OP_CHARNC is used. In UTF-8 mode, the |
|
|
character may be more than one byte long. (Earlier versions of PCRE used |
|
|
multi-character strings, but this was changed to allow some new features to be |
|
|
added.) |
|
|
|
|
|
|
|
| 273 |
Character classes |
Character classes |
| 274 |
----------------- |
----------------- |
| 275 |
|
|
| 276 |
If there is only one character, OP_CHAR or OP_CHARNC is used for a positive |
If there is only one character, OP_CHAR or OP_CHARI is used for a positive |
| 277 |
class, and OP_NOT for a negative one (that is, for something like [^a]). |
class, and OP_NOT or OP_NOTI for a negative one (that is, for something like |
| 278 |
However, in UTF-8 mode, the use of OP_NOT applies only to characters with |
[^a]). However, in UTF-8 mode, the use of OP_NOT[I] applies only to characters |
| 279 |
values < 128, because OP_NOT is confined to single bytes. |
with values < 128, because OP_NOT[I] is confined to single bytes. |
| 280 |
|
|
| 281 |
Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated, |
Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for a |
| 282 |
negated, single-character class. The normal ones (OP_STAR etc.) are used for a |
repeated, negated, single-character class. The normal single-character opcodes |
| 283 |
repeated positive single-character class. |
(OP_STAR, etc.) are used for a repeated positive single-character class. |
| 284 |
|
|
| 285 |
When there's more than one character in a class and all the characters are less |
When there is more than one character in a class and all the characters are |
| 286 |
than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative |
less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a |
| 287 |
one. In either case, the opcode is followed by a 32-byte bit map containing a 1 |
negative one. In either case, the opcode is followed by a 32-byte bit map |
| 288 |
bit for every character that is acceptable. The bits are counted from the least |
containing a 1 bit for every character that is acceptable. The bits are counted |
| 289 |
significant end of each byte. |
from the least significant end of each byte. In caseless mode, bits for both |
| 290 |
|
cases are set. |
| 291 |
|
|
| 292 |
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, |
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, |
| 293 |
subject characters with values greater than 256 can be handled correctly. For |
subject characters with values greater than 256 can be handled correctly. For |
| 294 |
OP_CLASS they don't match, whereas for OP_NCLASS they do. |
OP_CLASS they do not match, whereas for OP_NCLASS they do. |
| 295 |
|
|
| 296 |
For classes containing characters with values > 255, OP_XCLASS is used. It |
For classes containing characters with values > 255, OP_XCLASS is used. It |
| 297 |
optionally uses a bit map (if any characters lie within it), followed by a list |
optionally uses a bit map (if any characters lie within it), followed by a list |
| 298 |
of pairs and single characters. There is a flag character than indicates |
of pairs (for a range) and single characters. In caseless mode, both cases are |
| 299 |
whether it's a positive or a negative class. |
explicitly listed. There is a flag character than indicates whether it is a |
| 300 |
|
positive or a negative class. |
| 301 |
|
|
| 302 |
|
|
| 303 |
Back references |
Back references |
| 304 |
--------------- |
--------------- |
| 305 |
|
|
| 306 |
OP_REF is followed by two bytes containing the reference number. |
OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes containing the |
| 307 |
|
reference number. |
| 308 |
|
|
| 309 |
|
|
| 310 |
Repeating character classes and back references |
Repeating character classes and back references |
| 311 |
----------------------------------------------- |
----------------------------------------------- |
| 312 |
|
|
| 313 |
Single-character classes are handled specially (see above). This section |
Single-character classes are handled specially (see above). This section |
| 314 |
applies to OP_CLASS and OP_REF. In both cases, the repeat information follows |
applies to OP_CLASS and OP_REF[I]. In both cases, the repeat information |
| 315 |
the base item. The matching code looks at the following opcode to see if it is |
follows the base item. The matching code looks at the following opcode to see |
| 316 |
one of |
if it is one of |
| 317 |
|
|
| 318 |
OP_CRSTAR |
OP_CRSTAR |
| 319 |
OP_CRMINSTAR |
OP_CRMINSTAR |
| 353 |
|
|
| 354 |
OP_KET is used for subpatterns that do not repeat indefinitely, while |
OP_KET is used for subpatterns that do not repeat indefinitely, while |
| 355 |
OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or |
OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or |
| 356 |
maximally respectively. All three are followed by LINK_SIZE bytes giving (as a |
maximally respectively (see below for possessive repetitions). All three are |
| 357 |
positive number) the offset back to the matching bracket opcode. |
followed by LINK_SIZE bytes giving (as a positive number) the offset back to |
| 358 |
|
the matching bracket opcode. |
| 359 |
|
|
| 360 |
If a subpattern is quantified such that it is permitted to match zero times, it |
If a subpattern is quantified such that it is permitted to match zero times, it |
| 361 |
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are |
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are |
| 382 |
that it needs to check for matching an empty string when it hits OP_KETRMIN or |
that it needs to check for matching an empty string when it hits OP_KETRMIN or |
| 383 |
OP_KETRMAX, and if so, to break the loop. |
OP_KETRMAX, and if so, to break the loop. |
| 384 |
|
|
| 385 |
|
Possessive brackets |
| 386 |
|
------------------- |
| 387 |
|
|
| 388 |
|
When a repeated group (capturing or non-capturing) is marked as possessive by |
| 389 |
|
the "+" notation, e.g. (abc)++, different opcodes are used. Their names all |
| 390 |
|
have POS on the end, e.g. OP_BRAPOS instead of OP_BRA and OP_SCPBRPOS instead |
| 391 |
|
of OP_SCBRA. The end of such a group is marked by OP_KETRPOS. If the minimum |
| 392 |
|
repetition is zero, the group is preceded by OP_BRAPOSZERO. |
| 393 |
|
|
| 394 |
|
|
| 395 |
Assertions |
Assertions |
| 396 |
---------- |
---------- |
| 452 |
next item. |
next item. |
| 453 |
|
|
| 454 |
|
|
|
Changing options |
|
|
---------------- |
|
|
|
|
|
If any of the /i, /m, or /s options are changed within a pattern, an OP_OPT |
|
|
opcode is compiled, followed by one byte containing the new settings of these |
|
|
flags. If there are several alternatives, there is an occurrence of OP_OPT at |
|
|
the start of all those following the first options change, to set appropriate |
|
|
options for the start of the alternative. Immediately after the end of the |
|
|
group there is another such item to reset the flags to their previous values. A |
|
|
change of flag right at the very start of the pattern can be handled entirely |
|
|
at compile time, and so does not cause anything to be put into the compiled |
|
|
data. |
|
|
|
|
| 455 |
Philip Hazel |
Philip Hazel |
| 456 |
October 2009 |
August 2011 |