| 48 |
|
|
| 49 |
OP_END end of pattern |
OP_END end of pattern |
| 50 |
OP_ANY match any character |
OP_ANY match any character |
| 51 |
|
OP_ANYBYTE match any single byte, even in UTF-8 mode |
| 52 |
OP_SOD match start of data: \A |
OP_SOD match start of data: \A |
| 53 |
|
OP_SOM, start of match (subject + offset): \G |
| 54 |
OP_CIRC ^ (start of data, or after \n in multiline) |
OP_CIRC ^ (start of data, or after \n in multiline) |
| 55 |
OP_NOT_WORD_BOUNDARY \W |
OP_NOT_WORD_BOUNDARY \W |
| 56 |
OP_WORD_BOUNDARY \w |
OP_WORD_BOUNDARY \w |
| 63 |
OP_EODN match end of data or \n at end: \Z |
OP_EODN match end of data or \n at end: \Z |
| 64 |
OP_EOD match end of data: \z |
OP_EOD match end of data: \z |
| 65 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before \n in multiline) |
|
OP_RECURSE match the pattern recursively |
|
| 66 |
|
|
| 67 |
|
|
| 68 |
Repeating single characters |
Repeating single characters |
| 120 |
Character classes |
Character classes |
| 121 |
----------------- |
----------------- |
| 122 |
|
|
| 123 |
When characters less than 256 are involved, OP_CLASS is used for a character |
If there is only one character, OP_CHARS is used for a positive class, |
|
class. If there is only one character, OP_CHARS is used for a positive class, |
|
| 124 |
and OP_NOT for a negative one (that is, for something like [^a]). However, in |
and OP_NOT for a negative one (that is, for something like [^a]). However, in |
| 125 |
UTF-8 mode, this applies only to characters with values < 128, because OP_NOT |
UTF-8 mode, this applies only to characters with values < 128, because OP_NOT |
| 126 |
is confined to single bytes. |
is confined to single bytes. |
| 129 |
negated, single-character class. The normal ones (OP_STAR etc.) are used for a |
negated, single-character class. The normal ones (OP_STAR etc.) are used for a |
| 130 |
repeated positive single-character class. |
repeated positive single-character class. |
| 131 |
|
|
| 132 |
OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every |
When there's more than one character in a class and all the characters are less |
| 133 |
character that is acceptable. The bits are counted from the least significant |
than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative |
| 134 |
end of each byte. |
one. In either case, the opcode is followed by a 32-byte bit map containing a 1 |
| 135 |
|
bit for every character that is acceptable. The bits are counted from the least |
| 136 |
|
significant end of each byte. |
| 137 |
|
|
| 138 |
|
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, |
| 139 |
|
subject characters with values greater than 256 can be handled correctly. For |
| 140 |
|
OP_CLASS they don't match, whereas for OP_NCLASS they do. |
| 141 |
|
|
| 142 |
For classes containing characters with values > 255, OP_XCLASS is used. It |
For classes containing characters with values > 255, OP_XCLASS is used. It |
| 143 |
optionally uses a bit map (if any characters lie within it), followed by a list |
optionally uses a bit map (if any characters lie within it), followed by a list |
| 249 |
conditional subpattern always starts with one of the assertions. |
conditional subpattern always starts with one of the assertions. |
| 250 |
|
|
| 251 |
|
|
| 252 |
|
Recursion |
| 253 |
|
--------- |
| 254 |
|
|
| 255 |
|
Recursion either matches the current regex, or some subexpression. The opcode |
| 256 |
|
OP_RECURSE is followed by an value which is the offset to the starting bracket |
| 257 |
|
from the start of the whole pattern. |
| 258 |
|
|
| 259 |
|
|
| 260 |
|
Callout |
| 261 |
|
------- |
| 262 |
|
|
| 263 |
|
OP_CALLOUT is followed by one byte of data that holds a callout number in the |
| 264 |
|
range 0 to 255. |
| 265 |
|
|
| 266 |
|
|
| 267 |
Changing options |
Changing options |
| 268 |
---------------- |
---------------- |
| 269 |
|
|
| 278 |
data. |
data. |
| 279 |
|
|
| 280 |
Philip Hazel |
Philip Hazel |
| 281 |
August 2002 |
August 2003 |