| 94 |
pcrestack discussion of stack usage |
pcrestack discussion of stack usage |
| 95 |
pcretest description of the pcretest testing command |
pcretest description of the pcretest testing command |
| 96 |
|
|
| 97 |
In addition, in the "man" and HTML formats, there is a short page for |
In addition, in the "man" and HTML formats, there is a short page for |
| 98 |
each C library function, listing its arguments and results. |
each C library function, listing its arguments and results. |
| 99 |
|
|
| 100 |
|
|
| 101 |
LIMITATIONS |
LIMITATIONS |
| 102 |
|
|
| 103 |
There are some size limitations in PCRE but it is hoped that they will |
There are some size limitations in PCRE but it is hoped that they will |
| 104 |
never in practice be relevant. |
never in practice be relevant. |
| 105 |
|
|
| 106 |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
| 107 |
is compiled with the default internal linkage size of 2. If you want to |
is compiled with the default internal linkage size of 2. If you want to |
| 108 |
process regular expressions that are truly enormous, you can compile |
process regular expressions that are truly enormous, you can compile |
| 109 |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
| 110 |
the source distribution and the pcrebuild documentation for details). |
the source distribution and the pcrebuild documentation for details). |
| 111 |
In these cases the limit is substantially larger. However, the speed |
In these cases the limit is substantially larger. However, the speed |
| 112 |
of execution is slower. |
of execution is slower. |
| 113 |
|
|
| 114 |
All values in repeating quantifiers must be less than 65536. |
All values in repeating quantifiers must be less than 65536. |
| 119 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
| 120 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
| 121 |
|
|
| 122 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
| 123 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
| 124 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
| 125 |
inite repetition. This means that the available stack space may limit |
inite repetition. This means that the available stack space may limit |
| 126 |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
| 127 |
For a discussion of stack issues, see the pcrestack documentation. |
For a discussion of stack issues, see the pcrestack documentation. |
| 128 |
|
|
| 129 |
|
|
| 130 |
UTF-8 AND UNICODE PROPERTY SUPPORT |
UTF-8 AND UNICODE PROPERTY SUPPORT |
| 131 |
|
|
| 132 |
From release 3.3, PCRE has had some support for character strings |
From release 3.3, PCRE has had some support for character strings |
| 133 |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
| 134 |
to cover most common requirements, and in release 5.0 additional sup- |
to cover most common requirements, and in release 5.0 additional sup- |
| 135 |
port for Unicode general category properties was added. |
port for Unicode general category properties was added. |
| 136 |
|
|
| 137 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
| 138 |
support in the code, and, in addition, you must call pcre_compile() |
support in the code, and, in addition, you must call pcre_compile() |
| 139 |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
| 140 |
any subject strings that are matched against it are treated as UTF-8 |
any subject strings that are matched against it are treated as UTF-8 |
| 141 |
strings instead of just strings of bytes. |
strings instead of just strings of bytes. |
| 142 |
|
|
| 143 |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
| 144 |
the library will be a bit bigger, but the additional run time overhead |
the library will be a bit bigger, but the additional run time overhead |
| 145 |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
| 146 |
very big. |
very big. |
| 147 |
|
|
| 148 |
If PCRE is built with Unicode character property support (which implies |
If PCRE is built with Unicode character property support (which implies |
| 149 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
| 150 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
| 151 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
| 152 |
for a decimal number, the Unicode script names such as Arabic or Han, |
for a decimal number, the Unicode script names such as Arabic or Han, |
| 153 |
and the derived properties Any and L&. A full list is given in the |
and the derived properties Any and L&. A full list is given in the |
| 154 |
pcrepattern documentation. Only the short names for properties are sup- |
pcrepattern documentation. Only the short names for properties are sup- |
| 155 |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
| 156 |
ter}, is not supported. Furthermore, in Perl, many properties may |
ter}, is not supported. Furthermore, in Perl, many properties may |
| 157 |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
| 158 |
does not support this. |
does not support this. |
| 159 |
|
|
| 160 |
Validity of UTF-8 strings |
Validity of UTF-8 strings |
| 161 |
|
|
| 162 |
When you set the PCRE_UTF8 flag, the strings passed as patterns and |
When you set the PCRE_UTF8 flag, the strings passed as patterns and |
| 163 |
subjects are (by default) checked for validity on entry to the relevant |
subjects are (by default) checked for validity on entry to the relevant |
| 164 |
functions. From release 7.3 of PCRE, the check is according the rules |
functions. From release 7.3 of PCRE, the check is according the rules |
| 165 |
of RFC 3629, which are themselves derived from the Unicode specifica- |
of RFC 3629, which are themselves derived from the Unicode specifica- |
| 166 |
tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
| 167 |
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
| 168 |
check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
| 169 |
to U+DFFF. |
to U+DFFF. |
| 170 |
|
|
| 171 |
The excluded code points are the "Low Surrogate Area" of Unicode, of |
The excluded code points are the "Low Surrogate Area" of Unicode, of |
| 172 |
which the Unicode Standard says this: "The Low Surrogate Area does not |
which the Unicode Standard says this: "The Low Surrogate Area does not |
| 173 |
contain any character assignments, consequently no character code |
contain any character assignments, consequently no character code |
| 174 |
charts or namelists are provided for this area. Surrogates are reserved |
charts or namelists are provided for this area. Surrogates are reserved |
| 175 |
for use with UTF-16 and then must be used in pairs." The code points |
for use with UTF-16 and then must be used in pairs." The code points |
| 176 |
that are encoded by UTF-16 pairs are available as independent code |
that are encoded by UTF-16 pairs are available as independent code |
| 177 |
points in the UTF-8 encoding. (In other words, the whole surrogate |
points in the UTF-8 encoding. (In other words, the whole surrogate |
| 178 |
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
| 179 |
|
|
| 180 |
If an invalid UTF-8 string is passed to PCRE, an error return |
If an invalid UTF-8 string is passed to PCRE, an error return |
| 181 |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
| 182 |
that your strings are valid, and therefore want to skip these checks in |
that your strings are valid, and therefore want to skip these checks in |
| 183 |
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
| 184 |
compile time or at run time, PCRE assumes that the pattern or subject |
compile time or at run time, PCRE assumes that the pattern or subject |
| 185 |
it is given (respectively) contains only valid UTF-8 codes. In this |
it is given (respectively) contains only valid UTF-8 codes. In this |
| 186 |
case, it does not diagnose an invalid UTF-8 string. |
case, it does not diagnose an invalid UTF-8 string. |
| 187 |
|
|
| 188 |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
| 189 |
what happens depends on why the string is invalid. If the string con- |
what happens depends on why the string is invalid. If the string con- |
| 190 |
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
| 191 |
string of characters in the range 0 to 0x7FFFFFFF. In other words, |
string of characters in the range 0 to 0x7FFFFFFF. In other words, |
| 192 |
apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
| 193 |
strings according to the more liberal rules of RFC 2279. However, if |
strings according to the more liberal rules of RFC 2279. However, if |
| 194 |
the string does not even conform to RFC 2279, the result is undefined. |
the string does not even conform to RFC 2279, the result is undefined. |
| 195 |
Your program may crash. |
Your program may crash. |
| 196 |
|
|
| 197 |
If you want to process strings of values in the full range 0 to |
If you want to process strings of values in the full range 0 to |
| 198 |
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
| 199 |
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
| 200 |
this situation, you will have to apply your own validity check. |
this situation, you will have to apply your own validity check. |
| 201 |
|
|
| 202 |
General comments about UTF-8 mode |
General comments about UTF-8 mode |
| 203 |
|
|
| 204 |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
| 205 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
| 206 |
|
|
| 207 |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
| 208 |
characters for values greater than \177. |
characters for values greater than \177. |
| 209 |
|
|
| 210 |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
| 211 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
| 212 |
|
|
| 213 |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
| 214 |
gle byte. |
gle byte. |
| 215 |
|
|
| 216 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
| 217 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
| 218 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
| 219 |
|
|
| 220 |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
| 221 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
| 222 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
| 223 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
| 224 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
| 225 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
| 226 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
| 227 |
\p{Nd}. |
\p{Nd}. |
| 228 |
|
|
| 229 |
7. Similarly, characters that match the POSIX named character classes |
7. Similarly, characters that match the POSIX named character classes |
| 230 |
are all low-valued characters. |
are all low-valued characters. |
| 231 |
|
|
| 232 |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
| 233 |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
| 234 |
acters. |
acters. |
| 235 |
|
|
| 236 |
9. Case-insensitive matching applies only to characters whose values |
9. Case-insensitive matching applies only to characters whose values |
| 237 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
| 238 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
| 239 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
| 240 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
| 241 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
| 242 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
| 243 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
| 244 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
| 245 |
ported by PCRE. |
ported by PCRE. |
| 246 |
|
|
| 247 |
|
|
| 251 |
University Computing Service |
University Computing Service |
| 252 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
| 253 |
|
|
| 254 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
| 255 |
so I've taken it away. If you want to email me, use my two initials, |
so I've taken it away. If you want to email me, use my two initials, |
| 256 |
followed by the two digits 10, at the domain cam.ac.uk. |
followed by the two digits 10, at the domain cam.ac.uk. |
| 257 |
|
|
| 258 |
|
|
| 307 |
|
|
| 308 |
UTF-8 SUPPORT |
UTF-8 SUPPORT |
| 309 |
|
|
| 310 |
To build PCRE with support for UTF-8 character strings, add |
To build PCRE with support for UTF-8 Unicode character strings, add |
| 311 |
|
|
| 312 |
--enable-utf8 |
--enable-utf8 |
| 313 |
|
|
| 316 |
have have to set the PCRE_UTF8 option when you call the pcre_compile() |
have have to set the PCRE_UTF8 option when you call the pcre_compile() |
| 317 |
function. |
function. |
| 318 |
|
|
| 319 |
|
If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE |
| 320 |
|
expects its input to be either ASCII or UTF-8 (depending on the runtime |
| 321 |
|
option). It is not possible to support both EBCDIC and UTF-8 codes in |
| 322 |
|
the same version of the library. Consequently, --enable-utf8 and |
| 323 |
|
--enable-ebcdic are mutually exclusive. |
| 324 |
|
|
| 325 |
|
|
| 326 |
UNICODE CHARACTER PROPERTY SUPPORT |
UNICODE CHARACTER PROPERTY SUPPORT |
| 327 |
|
|
| 343 |
|
|
| 344 |
CODE VALUE OF NEWLINE |
CODE VALUE OF NEWLINE |
| 345 |
|
|
| 346 |
By default, PCRE interprets character 10 (linefeed, LF) as indicating |
By default, PCRE interprets the linefeed (LF) character as indicating |
| 347 |
the end of a line. This is the normal newline character on Unix-like |
the end of a line. This is the normal newline character on Unix-like |
| 348 |
systems. You can compile PCRE to use character 13 (carriage return, CR) |
systems. You can compile PCRE to use carriage return (CR) instead, by |
| 349 |
instead, by adding |
adding |
| 350 |
|
|
| 351 |
--enable-newline-is-cr |
--enable-newline-is-cr |
| 352 |
|
|
| 369 |
|
|
| 370 |
causes PCRE to recognize any Unicode newline sequence. |
causes PCRE to recognize any Unicode newline sequence. |
| 371 |
|
|
| 372 |
Whatever line ending convention is selected when PCRE is built can be |
Whatever line ending convention is selected when PCRE is built can be |
| 373 |
overridden when the library functions are called. At build time it is |
overridden when the library functions are called. At build time it is |
| 374 |
conventional to use the standard for your operating system. |
conventional to use the standard for your operating system. |
| 375 |
|
|
| 376 |
|
|
| 377 |
WHAT \R MATCHES |
WHAT \R MATCHES |
| 378 |
|
|
| 379 |
By default, the sequence \R in a pattern matches any Unicode newline |
By default, the sequence \R in a pattern matches any Unicode newline |
| 380 |
sequence, whatever has been selected as the line ending sequence. If |
sequence, whatever has been selected as the line ending sequence. If |
| 381 |
you specify |
you specify |
| 382 |
|
|
| 383 |
--enable-bsr-anycrlf |
--enable-bsr-anycrlf |
| 384 |
|
|
| 385 |
the default is changed so that \R matches only CR, LF, or CRLF. What- |
the default is changed so that \R matches only CR, LF, or CRLF. What- |
| 386 |
ever is selected when PCRE is built can be overridden when the library |
ever is selected when PCRE is built can be overridden when the library |
| 387 |
functions are called. |
functions are called. |
| 388 |
|
|
| 389 |
|
|
| 390 |
BUILDING SHARED AND STATIC LIBRARIES |
BUILDING SHARED AND STATIC LIBRARIES |
| 391 |
|
|
| 392 |
The PCRE building process uses libtool to build both shared and static |
The PCRE building process uses libtool to build both shared and static |
| 393 |
Unix libraries by default. You can suppress one of these by adding one |
Unix libraries by default. You can suppress one of these by adding one |
| 394 |
of |
of |
| 395 |
|
|
| 396 |
--disable-shared |
--disable-shared |
| 402 |
POSIX MALLOC USAGE |
POSIX MALLOC USAGE |
| 403 |
|
|
| 404 |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
| 405 |
umentation), additional working storage is required for holding the |
umentation), additional working storage is required for holding the |
| 406 |
pointers to capturing substrings, because PCRE requires three integers |
pointers to capturing substrings, because PCRE requires three integers |
| 407 |
per substring, whereas the POSIX interface provides only two. If the |
per substring, whereas the POSIX interface provides only two. If the |
| 408 |
number of expected substrings is small, the wrapper function uses space |
number of expected substrings is small, the wrapper function uses space |
| 409 |
on the stack, because this is faster than using malloc() for each call. |
on the stack, because this is faster than using malloc() for each call. |
| 410 |
The default threshold above which the stack is no longer used is 10; it |
The default threshold above which the stack is no longer used is 10; it |
| 417 |
|
|
| 418 |
HANDLING VERY LARGE PATTERNS |
HANDLING VERY LARGE PATTERNS |
| 419 |
|
|
| 420 |
Within a compiled pattern, offset values are used to point from one |
Within a compiled pattern, offset values are used to point from one |
| 421 |
part to another (for example, from an opening parenthesis to an alter- |
part to another (for example, from an opening parenthesis to an alter- |
| 422 |
nation metacharacter). By default, two-byte values are used for these |
nation metacharacter). By default, two-byte values are used for these |
| 423 |
offsets, leading to a maximum size for a compiled pattern of around |
offsets, leading to a maximum size for a compiled pattern of around |
| 424 |
64K. This is sufficient to handle all but the most gigantic patterns. |
64K. This is sufficient to handle all but the most gigantic patterns. |
| 425 |
Nevertheless, some people do want to process enormous patterns, so it |
Nevertheless, some people do want to process enormous patterns, so it |
| 426 |
is possible to compile PCRE to use three-byte or four-byte offsets by |
is possible to compile PCRE to use three-byte or four-byte offsets by |
| 427 |
adding a setting such as |
adding a setting such as |
| 428 |
|
|
| 429 |
--with-link-size=3 |
--with-link-size=3 |
| 430 |
|
|
| 431 |
to the configure command. The value given must be 2, 3, or 4. Using |
to the configure command. The value given must be 2, 3, or 4. Using |
| 432 |
longer offsets slows down the operation of PCRE because it has to load |
longer offsets slows down the operation of PCRE because it has to load |
| 433 |
additional bytes when handling them. |
additional bytes when handling them. |
| 434 |
|
|
| 435 |
|
|
| 436 |
AVOIDING EXCESSIVE STACK USAGE |
AVOIDING EXCESSIVE STACK USAGE |
| 437 |
|
|
| 438 |
When matching with the pcre_exec() function, PCRE implements backtrack- |
When matching with the pcre_exec() function, PCRE implements backtrack- |
| 439 |
ing by making recursive calls to an internal function called match(). |
ing by making recursive calls to an internal function called match(). |
| 440 |
In environments where the size of the stack is limited, this can se- |
In environments where the size of the stack is limited, this can se- |
| 441 |
verely limit PCRE's operation. (The Unix environment does not usually |
verely limit PCRE's operation. (The Unix environment does not usually |
| 442 |
suffer from this problem, but it may sometimes be necessary to increase |
suffer from this problem, but it may sometimes be necessary to increase |
| 443 |
the maximum stack size. There is a discussion in the pcrestack docu- |
the maximum stack size. There is a discussion in the pcrestack docu- |
| 444 |
mentation.) An alternative approach to recursion that uses memory from |
mentation.) An alternative approach to recursion that uses memory from |
| 445 |
the heap to remember data, instead of using recursive function calls, |
the heap to remember data, instead of using recursive function calls, |
| 446 |
has been implemented to work round the problem of limited stack size. |
has been implemented to work round the problem of limited stack size. |
| 447 |
If you want to build a version of PCRE that works this way, add |
If you want to build a version of PCRE that works this way, add |
| 448 |
|
|
| 449 |
--disable-stack-for-recursion |
--disable-stack-for-recursion |
| 450 |
|
|
| 451 |
to the configure command. With this configuration, PCRE will use the |
to the configure command. With this configuration, PCRE will use the |
| 452 |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
| 453 |
ment functions. By default these point to malloc() and free(), but you |
ment functions. By default these point to malloc() and free(), but you |
| 454 |
can replace the pointers so that your own functions are used. |
can replace the pointers so that your own functions are used. |
| 455 |
|
|
| 456 |
Separate functions are provided rather than using pcre_malloc and |
Separate functions are provided rather than using pcre_malloc and |
| 457 |
pcre_free because the usage is very predictable: the block sizes |
pcre_free because the usage is very predictable: the block sizes |
| 458 |
requested are always the same, and the blocks are always freed in |
requested are always the same, and the blocks are always freed in |
| 459 |
reverse order. A calling program might be able to implement optimized |
reverse order. A calling program might be able to implement optimized |
| 460 |
functions that perform better than malloc() and free(). PCRE runs |
functions that perform better than malloc() and free(). PCRE runs |
| 461 |
noticeably more slowly when built in this way. This option affects only |
noticeably more slowly when built in this way. This option affects only |
| 462 |
the pcre_exec() function; it is not relevant for the the |
the pcre_exec() function; it is not relevant for the the |
| 463 |
pcre_dfa_exec() function. |
pcre_dfa_exec() function. |
| 464 |
|
|
| 465 |
|
|
| 466 |
LIMITING PCRE RESOURCE USAGE |
LIMITING PCRE RESOURCE USAGE |
| 467 |
|
|
| 468 |
Internally, PCRE has a function called match(), which it calls repeat- |
Internally, PCRE has a function called match(), which it calls repeat- |
| 469 |
edly (sometimes recursively) when matching a pattern with the |
edly (sometimes recursively) when matching a pattern with the |
| 470 |
pcre_exec() function. By controlling the maximum number of times this |
pcre_exec() function. By controlling the maximum number of times this |
| 471 |
function may be called during a single matching operation, a limit can |
function may be called during a single matching operation, a limit can |
| 472 |
be placed on the resources used by a single call to pcre_exec(). The |
be placed on the resources used by a single call to pcre_exec(). The |
| 473 |
limit can be changed at run time, as described in the pcreapi documen- |
limit can be changed at run time, as described in the pcreapi documen- |
| 474 |
tation. The default is 10 million, but this can be changed by adding a |
tation. The default is 10 million, but this can be changed by adding a |
| 475 |
setting such as |
setting such as |
| 476 |
|
|
| 477 |
--with-match-limit=500000 |
--with-match-limit=500000 |
| 478 |
|
|
| 479 |
to the configure command. This setting has no effect on the |
to the configure command. This setting has no effect on the |
| 480 |
pcre_dfa_exec() matching function. |
pcre_dfa_exec() matching function. |
| 481 |
|
|
| 482 |
In some environments it is desirable to limit the depth of recursive |
In some environments it is desirable to limit the depth of recursive |
| 483 |
calls of match() more strictly than the total number of calls, in order |
calls of match() more strictly than the total number of calls, in order |
| 484 |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
| 485 |
for-recursion is specified) that is used. A second limit controls this; |
for-recursion is specified) that is used. A second limit controls this; |
| 486 |
it defaults to the value that is set for --with-match-limit, which |
it defaults to the value that is set for --with-match-limit, which |
| 487 |
imposes no additional constraints. However, you can set a lower limit |
imposes no additional constraints. However, you can set a lower limit |
| 488 |
by adding, for example, |
by adding, for example, |
| 489 |
|
|
| 490 |
--with-match-limit-recursion=10000 |
--with-match-limit-recursion=10000 |
| 491 |
|
|
| 492 |
to the configure command. This value can also be overridden at run |
to the configure command. This value can also be overridden at run |
| 493 |
time. |
time. |
| 494 |
|
|
| 495 |
|
|
| 496 |
CREATING CHARACTER TABLES AT BUILD TIME |
CREATING CHARACTER TABLES AT BUILD TIME |
| 497 |
|
|
| 498 |
PCRE uses fixed tables for processing characters whose code values are |
PCRE uses fixed tables for processing characters whose code values are |
| 499 |
less than 256. By default, PCRE is built with a set of tables that are |
less than 256. By default, PCRE is built with a set of tables that are |
| 500 |
distributed in the file pcre_chartables.c.dist. These tables are for |
distributed in the file pcre_chartables.c.dist. These tables are for |
| 501 |
ASCII codes only. If you add |
ASCII codes only. If you add |
| 502 |
|
|
| 503 |
--enable-rebuild-chartables |
--enable-rebuild-chartables |
| 504 |
|
|
| 505 |
to the configure command, the distributed tables are no longer used. |
to the configure command, the distributed tables are no longer used. |
| 506 |
Instead, a program called dftables is compiled and run. This outputs |
Instead, a program called dftables is compiled and run. This outputs |
| 507 |
the source for new set of tables, created in the default locale of your |
the source for new set of tables, created in the default locale of your |
| 508 |
C runtime system. (This method of replacing the tables does not work if |
C runtime system. (This method of replacing the tables does not work if |
| 509 |
you are cross compiling, because dftables is run on the local host. If |
you are cross compiling, because dftables is run on the local host. If |
| 510 |
you need to create alternative tables when cross compiling, you will |
you need to create alternative tables when cross compiling, you will |
| 511 |
have to do so "by hand".) |
have to do so "by hand".) |
| 512 |
|
|
| 513 |
|
|
| 514 |
USING EBCDIC CODE |
USING EBCDIC CODE |
| 515 |
|
|
| 516 |
PCRE assumes by default that it will run in an environment where the |
PCRE assumes by default that it will run in an environment where the |
| 517 |
character code is ASCII (or Unicode, which is a superset of ASCII). |
character code is ASCII (or Unicode, which is a superset of ASCII). |
| 518 |
This is the case for most computer operating systems. PCRE can, how- |
This is the case for most computer operating systems. PCRE can, how- |
| 519 |
ever, be compiled to run in an EBCDIC environment by adding |
ever, be compiled to run in an EBCDIC environment by adding |
| 520 |
|
|
| 521 |
--enable-ebcdic |
--enable-ebcdic |
| 522 |
|
|
| 523 |
to the configure command. This setting implies --enable-rebuild-charta- |
to the configure command. This setting implies --enable-rebuild-charta- |
| 524 |
bles. You should only use it if you know that you are in an EBCDIC |
bles. You should only use it if you know that you are in an EBCDIC |
| 525 |
environment (for example, an IBM mainframe operating system). |
environment (for example, an IBM mainframe operating system). The |
| 526 |
|
--enable-ebcdic option is incompatible with --enable-utf8. |
| 527 |
|
|
| 528 |
|
|
| 529 |
PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT |
PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT |
| 585 |
|
|
| 586 |
REVISION |
REVISION |
| 587 |
|
|
| 588 |
Last updated: 13 April 2008 |
Last updated: 17 March 2009 |
| 589 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 590 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 591 |
|
|
| 592 |
|
|
| 1006 |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
| 1007 |
callout function pointed to by pcre_callout, are shared by all threads. |
callout function pointed to by pcre_callout, are shared by all threads. |
| 1008 |
|
|
| 1009 |
The compiled form of a regular expression is not altered during match- |
The compiled form of a regular expression is not altered during match- |
| 1010 |
ing, so the same compiled pattern can safely be used by several threads |
ing, so the same compiled pattern can safely be used by several threads |
| 1011 |
at once. |
at once. |
| 1012 |
|
|
| 1014 |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
| 1015 |
|
|
| 1016 |
The compiled form of a regular expression can be saved and re-used at a |
The compiled form of a regular expression can be saved and re-used at a |
| 1017 |
later time, possibly by a different program, and even on a host other |
later time, possibly by a different program, and even on a host other |
| 1018 |
than the one on which it was compiled. Details are given in the |
than the one on which it was compiled. Details are given in the |
| 1019 |
pcreprecompile documentation. However, compiling a regular expression |
pcreprecompile documentation. However, compiling a regular expression |
| 1020 |
with one version of PCRE for use with a different version is not guar- |
with one version of PCRE for use with a different version is not guar- |
| 1021 |
anteed to work and may cause crashes. |
anteed to work and may cause crashes. |
| 1022 |
|
|
| 1023 |
|
|
| 1025 |
|
|
| 1026 |
int pcre_config(int what, void *where); |
int pcre_config(int what, void *where); |
| 1027 |
|
|
| 1028 |
The function pcre_config() makes it possible for a PCRE client to dis- |
The function pcre_config() makes it possible for a PCRE client to dis- |
| 1029 |
cover which optional features have been compiled into the PCRE library. |
cover which optional features have been compiled into the PCRE library. |
| 1030 |
The pcrebuild documentation has more details about these optional fea- |
The pcrebuild documentation has more details about these optional fea- |
| 1031 |
tures. |
tures. |
| 1032 |
|
|
| 1033 |
The first argument for pcre_config() is an integer, specifying which |
The first argument for pcre_config() is an integer, specifying which |
| 1034 |
information is required; the second argument is a pointer to a variable |
information is required; the second argument is a pointer to a variable |
| 1035 |
into which the information is placed. The following information is |
into which the information is placed. The following information is |
| 1036 |
available: |
available: |
| 1037 |
|
|
| 1038 |
PCRE_CONFIG_UTF8 |
PCRE_CONFIG_UTF8 |
| 1039 |
|
|
| 1040 |
The output is an integer that is set to one if UTF-8 support is avail- |
The output is an integer that is set to one if UTF-8 support is avail- |
| 1041 |
able; otherwise it is set to zero. |
able; otherwise it is set to zero. |
| 1042 |
|
|
| 1043 |
PCRE_CONFIG_UNICODE_PROPERTIES |
PCRE_CONFIG_UNICODE_PROPERTIES |
| 1044 |
|
|
| 1045 |
The output is an integer that is set to one if support for Unicode |
The output is an integer that is set to one if support for Unicode |
| 1046 |
character properties is available; otherwise it is set to zero. |
character properties is available; otherwise it is set to zero. |
| 1047 |
|
|
| 1048 |
PCRE_CONFIG_NEWLINE |
PCRE_CONFIG_NEWLINE |
| 1049 |
|
|
| 1050 |
The output is an integer whose value specifies the default character |
The output is an integer whose value specifies the default character |
| 1051 |
sequence that is recognized as meaning "newline". The four values that |
sequence that is recognized as meaning "newline". The four values that |
| 1052 |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
| 1053 |
and -1 for ANY. The default should normally be the standard sequence |
and -1 for ANY. Though they are derived from ASCII, the same values |
| 1054 |
for your operating system. |
are returned in EBCDIC environments. The default should normally corre- |
| 1055 |
|
spond to the standard sequence for your operating system. |
| 1056 |
|
|
| 1057 |
PCRE_CONFIG_BSR |
PCRE_CONFIG_BSR |
| 1058 |
|
|
| 1079 |
|
|
| 1080 |
PCRE_CONFIG_MATCH_LIMIT |
PCRE_CONFIG_MATCH_LIMIT |
| 1081 |
|
|
| 1082 |
The output is an integer that gives the default limit for the number of |
The output is a long integer that gives the default limit for the num- |
| 1083 |
internal matching function calls in a pcre_exec() execution. Further |
ber of internal matching function calls in a pcre_exec() execution. |
| 1084 |
details are given with pcre_exec() below. |
Further details are given with pcre_exec() below. |
| 1085 |
|
|
| 1086 |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
| 1087 |
|
|
| 1088 |
The output is an integer that gives the default limit for the depth of |
The output is a long integer that gives the default limit for the depth |
| 1089 |
recursion when calling the internal matching function in a pcre_exec() |
of recursion when calling the internal matching function in a |
| 1090 |
execution. Further details are given with pcre_exec() below. |
pcre_exec() execution. Further details are given with pcre_exec() |
| 1091 |
|
below. |
| 1092 |
|
|
| 1093 |
PCRE_CONFIG_STACKRECURSE |
PCRE_CONFIG_STACKRECURSE |
| 1094 |
|
|
| 1095 |
The output is an integer that is set to one if internal recursion when |
The output is an integer that is set to one if internal recursion when |
| 1096 |
running pcre_exec() is implemented by recursive function calls that use |
running pcre_exec() is implemented by recursive function calls that use |
| 1097 |
the stack to remember their state. This is the usual way that PCRE is |
the stack to remember their state. This is the usual way that PCRE is |
| 1098 |
compiled. The output is zero if PCRE was compiled to use blocks of data |
compiled. The output is zero if PCRE was compiled to use blocks of data |
| 1099 |
on the heap instead of recursive function calls. In this case, |
on the heap instead of recursive function calls. In this case, |
| 1100 |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
| 1101 |
blocks on the heap, thus avoiding the use of the stack. |
blocks on the heap, thus avoiding the use of the stack. |
| 1102 |
|
|
| 1103 |
|
|
| 1114 |
|
|
| 1115 |
Either of the functions pcre_compile() or pcre_compile2() can be called |
Either of the functions pcre_compile() or pcre_compile2() can be called |
| 1116 |
to compile a pattern into an internal form. The only difference between |
to compile a pattern into an internal form. The only difference between |
| 1117 |
the two interfaces is that pcre_compile2() has an additional argument, |
the two interfaces is that pcre_compile2() has an additional argument, |
| 1118 |
errorcodeptr, via which a numerical error code can be returned. |
errorcodeptr, via which a numerical error code can be returned. |
| 1119 |
|
|
| 1120 |
The pattern is a C string terminated by a binary zero, and is passed in |
The pattern is a C string terminated by a binary zero, and is passed in |
| 1121 |
the pattern argument. A pointer to a single block of memory that is |
the pattern argument. A pointer to a single block of memory that is |
| 1122 |
obtained via pcre_malloc is returned. This contains the compiled code |
obtained via pcre_malloc is returned. This contains the compiled code |
| 1123 |
and related data. The pcre type is defined for the returned block; this |
and related data. The pcre type is defined for the returned block; this |
| 1124 |
is a typedef for a structure whose contents are not externally defined. |
is a typedef for a structure whose contents are not externally defined. |
| 1125 |
It is up to the caller to free the memory (via pcre_free) when it is no |
It is up to the caller to free the memory (via pcre_free) when it is no |
| 1126 |
longer required. |
longer required. |
| 1127 |
|
|
| 1128 |
Although the compiled code of a PCRE regex is relocatable, that is, it |
Although the compiled code of a PCRE regex is relocatable, that is, it |
| 1129 |
does not depend on memory location, the complete pcre data block is not |
does not depend on memory location, the complete pcre data block is not |
| 1130 |
fully relocatable, because it may contain a copy of the tableptr argu- |
fully relocatable, because it may contain a copy of the tableptr argu- |
| 1131 |
ment, which is an address (see below). |
ment, which is an address (see below). |
| 1132 |
|
|
| 1133 |
The options argument contains various bit settings that affect the com- |
The options argument contains various bit settings that affect the com- |
| 1134 |
pilation. It should be zero if no options are required. The available |
pilation. It should be zero if no options are required. The available |
| 1135 |
options are described below. Some of them, in particular, those that |
options are described below. Some of them, in particular, those that |
| 1136 |
are compatible with Perl, can also be set and unset from within the |
are compatible with Perl, can also be set and unset from within the |
| 1137 |
pattern (see the detailed description in the pcrepattern documenta- |
pattern (see the detailed description in the pcrepattern documenta- |
| 1138 |
tion). For these options, the contents of the options argument speci- |
tion). For these options, the contents of the options argument speci- |
| 1139 |
fies their initial settings at the start of compilation and execution. |
fies their initial settings at the start of compilation and execution. |
| 1140 |
The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time |
The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time |
| 1141 |
of matching as well as at compile time. |
of matching as well as at compile time. |
| 1142 |
|
|
| 1143 |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
| 1144 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
| 1145 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
| 1146 |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
| 1147 |
try to free it. The offset from the start of the pattern to the charac- |
try to free it. The offset from the start of the pattern to the charac- |
| 1148 |
ter where the error was discovered is placed in the variable pointed to |
ter where the error was discovered is placed in the variable pointed to |
| 1149 |
by erroffset, which must not be NULL. If it is, an immediate error is |
by erroffset, which must not be NULL. If it is, an immediate error is |
| 1150 |
given. |
given. |
| 1151 |
|
|
| 1152 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
| 1153 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
| 1154 |
via this argument in the event of an error. This is in addition to the |
via this argument in the event of an error. This is in addition to the |
| 1155 |
textual error message. Error codes and messages are listed below. |
textual error message. Error codes and messages are listed below. |
| 1156 |
|
|
| 1157 |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
| 1158 |
character tables that are built when PCRE is compiled, using the |
character tables that are built when PCRE is compiled, using the |
| 1159 |
default C locale. Otherwise, tableptr must be an address that is the |
default C locale. Otherwise, tableptr must be an address that is the |
| 1160 |
result of a call to pcre_maketables(). This value is stored with the |
result of a call to pcre_maketables(). This value is stored with the |
| 1161 |
compiled pattern, and used again by pcre_exec(), unless another table |
compiled pattern, and used again by pcre_exec(), unless another table |
| 1162 |
pointer is passed to it. For more discussion, see the section on locale |
pointer is passed to it. For more discussion, see the section on locale |
| 1163 |
support below. |
support below. |
| 1164 |
|
|
| 1165 |
This code fragment shows a typical straightforward call to pcre_com- |
This code fragment shows a typical straightforward call to pcre_com- |
| 1166 |
pile(): |
pile(): |
| 1167 |
|
|
| 1168 |
pcre *re; |
pcre *re; |
| 1175 |
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
| 1176 |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
| 1177 |
|
|
| 1178 |
The following names for option bits are defined in the pcre.h header |
The following names for option bits are defined in the pcre.h header |
| 1179 |
file: |
file: |
| 1180 |
|
|
| 1181 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 1182 |
|
|
| 1183 |
If this bit is set, the pattern is forced to be "anchored", that is, it |
If this bit is set, the pattern is forced to be "anchored", that is, it |
| 1184 |
is constrained to match only at the first matching point in the string |
is constrained to match only at the first matching point in the string |
| 1185 |
that is being searched (the "subject string"). This effect can also be |
that is being searched (the "subject string"). This effect can also be |
| 1186 |
achieved by appropriate constructs in the pattern itself, which is the |
achieved by appropriate constructs in the pattern itself, which is the |
| 1187 |
only way to do it in Perl. |
only way to do it in Perl. |
| 1188 |
|
|
| 1189 |
PCRE_AUTO_CALLOUT |
PCRE_AUTO_CALLOUT |
| 1190 |
|
|
| 1191 |
If this bit is set, pcre_compile() automatically inserts callout items, |
If this bit is set, pcre_compile() automatically inserts callout items, |
| 1192 |
all with number 255, before each pattern item. For discussion of the |
all with number 255, before each pattern item. For discussion of the |
| 1193 |
callout facility, see the pcrecallout documentation. |
callout facility, see the pcrecallout documentation. |
| 1194 |
|
|
| 1195 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
| 1196 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
| 1197 |
|
|
| 1198 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
| 1199 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
| 1200 |
or to match any Unicode newline sequence. The default is specified when |
or to match any Unicode newline sequence. The default is specified when |
| 1201 |
PCRE is built. It can be overridden from within the pattern, or by set- |
PCRE is built. It can be overridden from within the pattern, or by set- |
| 1202 |
ting an option when a compiled pattern is matched. |
ting an option when a compiled pattern is matched. |
| 1203 |
|
|
| 1204 |
PCRE_CASELESS |
PCRE_CASELESS |
| 1205 |
|
|
| 1206 |
If this bit is set, letters in the pattern match both upper and lower |
If this bit is set, letters in the pattern match both upper and lower |
| 1207 |
case letters. It is equivalent to Perl's /i option, and it can be |
case letters. It is equivalent to Perl's /i option, and it can be |
| 1208 |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
| 1209 |
always understands the concept of case for characters whose values are |
always understands the concept of case for characters whose values are |
| 1210 |
less than 128, so caseless matching is always possible. For characters |
less than 128, so caseless matching is always possible. For characters |
| 1211 |
with higher values, the concept of case is supported if PCRE is com- |
with higher values, the concept of case is supported if PCRE is com- |
| 1212 |
piled with Unicode property support, but not otherwise. If you want to |
piled with Unicode property support, but not otherwise. If you want to |
| 1213 |
use caseless matching for characters 128 and above, you must ensure |
use caseless matching for characters 128 and above, you must ensure |
| 1214 |
that PCRE is compiled with Unicode property support as well as with |
that PCRE is compiled with Unicode property support as well as with |
| 1215 |
UTF-8 support. |
UTF-8 support. |
| 1216 |
|
|
| 1217 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
| 1218 |
|
|
| 1219 |
If this bit is set, a dollar metacharacter in the pattern matches only |
If this bit is set, a dollar metacharacter in the pattern matches only |
| 1220 |
at the end of the subject string. Without this option, a dollar also |
at the end of the subject string. Without this option, a dollar also |
| 1221 |
matches immediately before a newline at the end of the string (but not |
matches immediately before a newline at the end of the string (but not |
| 1222 |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
| 1223 |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
| 1224 |
Perl, and no way to set it within a pattern. |
Perl, and no way to set it within a pattern. |
| 1225 |
|
|
| 1226 |
PCRE_DOTALL |
PCRE_DOTALL |
| 1227 |
|
|
| 1228 |
If this bit is set, a dot metacharater in the pattern matches all char- |
If this bit is set, a dot metacharater in the pattern matches all char- |
| 1229 |
acters, including those that indicate newline. Without it, a dot does |
acters, including those that indicate newline. Without it, a dot does |
| 1230 |
not match when the current position is at a newline. This option is |
not match when the current position is at a newline. This option is |
| 1231 |
equivalent to Perl's /s option, and it can be changed within a pattern |
equivalent to Perl's /s option, and it can be changed within a pattern |
| 1232 |
by a (?s) option setting. A negative class such as [^a] always matches |
by a (?s) option setting. A negative class such as [^a] always matches |
| 1233 |
newline characters, independent of the setting of this option. |
newline characters, independent of the setting of this option. |
| 1234 |
|
|
| 1235 |
PCRE_DUPNAMES |
PCRE_DUPNAMES |
| 1236 |
|
|
| 1237 |
If this bit is set, names used to identify capturing subpatterns need |
If this bit is set, names used to identify capturing subpatterns need |
| 1238 |
not be unique. This can be helpful for certain types of pattern when it |
not be unique. This can be helpful for certain types of pattern when it |
| 1239 |
is known that only one instance of the named subpattern can ever be |
is known that only one instance of the named subpattern can ever be |
| 1240 |
matched. There are more details of named subpatterns below; see also |
matched. There are more details of named subpatterns below; see also |
| 1241 |
the pcrepattern documentation. |
the pcrepattern documentation. |
| 1242 |
|
|
| 1243 |
PCRE_EXTENDED |
PCRE_EXTENDED |
| 1244 |
|
|
| 1245 |
If this bit is set, whitespace data characters in the pattern are |
If this bit is set, whitespace data characters in the pattern are |
| 1246 |
totally ignored except when escaped or inside a character class. White- |
totally ignored except when escaped or inside a character class. White- |
| 1247 |
space does not include the VT character (code 11). In addition, charac- |
space does not include the VT character (code 11). In addition, charac- |
| 1248 |
ters between an unescaped # outside a character class and the next new- |
ters between an unescaped # outside a character class and the next new- |
| 1249 |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
| 1250 |
option, and it can be changed within a pattern by a (?x) option set- |
option, and it can be changed within a pattern by a (?x) option set- |
| 1251 |
ting. |
ting. |
| 1252 |
|
|
| 1253 |
This option makes it possible to include comments inside complicated |
This option makes it possible to include comments inside complicated |
| 1254 |
patterns. Note, however, that this applies only to data characters. |
patterns. Note, however, that this applies only to data characters. |
| 1255 |
Whitespace characters may never appear within special character |
Whitespace characters may never appear within special character |
| 1256 |
sequences in a pattern, for example within the sequence (?( which |
sequences in a pattern, for example within the sequence (?( which |
| 1257 |
introduces a conditional subpattern. |
introduces a conditional subpattern. |
| 1258 |
|
|
| 1259 |
PCRE_EXTRA |
PCRE_EXTRA |
| 1260 |
|
|
| 1261 |
This option was invented in order to turn on additional functionality |
This option was invented in order to turn on additional functionality |
| 1262 |
of PCRE that is incompatible with Perl, but it is currently of very |
of PCRE that is incompatible with Perl, but it is currently of very |
| 1263 |
little use. When set, any backslash in a pattern that is followed by a |
little use. When set, any backslash in a pattern that is followed by a |
| 1264 |
letter that has no special meaning causes an error, thus reserving |
letter that has no special meaning causes an error, thus reserving |
| 1265 |
these combinations for future expansion. By default, as in Perl, a |
these combinations for future expansion. By default, as in Perl, a |
| 1266 |
backslash followed by a letter with no special meaning is treated as a |
backslash followed by a letter with no special meaning is treated as a |
| 1267 |
literal. (Perl can, however, be persuaded to give a warning for this.) |
literal. (Perl can, however, be persuaded to give a warning for this.) |
| 1268 |
There are at present no other features controlled by this option. It |
There are at present no other features controlled by this option. It |
| 1269 |
can also be set by a (?X) option setting within a pattern. |
can also be set by a (?X) option setting within a pattern. |
| 1270 |
|
|
| 1271 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
| 1272 |
|
|
| 1273 |
If this option is set, an unanchored pattern is required to match |
If this option is set, an unanchored pattern is required to match |
| 1274 |
before or at the first newline in the subject string, though the |
before or at the first newline in the subject string, though the |
| 1275 |
matched text may continue over the newline. |
matched text may continue over the newline. |
| 1276 |
|
|
| 1277 |
PCRE_JAVASCRIPT_COMPAT |
PCRE_JAVASCRIPT_COMPAT |
| 1278 |
|
|
| 1279 |
If this option is set, PCRE's behaviour is changed in some ways so that |
If this option is set, PCRE's behaviour is changed in some ways so that |
| 1280 |
it is compatible with JavaScript rather than Perl. The changes are as |
it is compatible with JavaScript rather than Perl. The changes are as |
| 1281 |
follows: |
follows: |
| 1282 |
|
|
| 1283 |
(1) A lone closing square bracket in a pattern causes a compile-time |
(1) A lone closing square bracket in a pattern causes a compile-time |
| 1284 |
error, because this is illegal in JavaScript (by default it is treated |
error, because this is illegal in JavaScript (by default it is treated |
| 1285 |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
| 1286 |
option is set. |
option is set. |
| 1287 |
|
|
| 1288 |
(2) At run time, a back reference to an unset subpattern group matches |
(2) At run time, a back reference to an unset subpattern group matches |
| 1289 |
an empty string (by default this causes the current matching alterna- |
an empty string (by default this causes the current matching alterna- |
| 1290 |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
| 1291 |
set (assuming it can find an "a" in the subject), whereas it fails by |
set (assuming it can find an "a" in the subject), whereas it fails by |
| 1292 |
default, for Perl compatibility. |
default, for Perl compatibility. |
| 1293 |
|
|
| 1294 |
PCRE_MULTILINE |
PCRE_MULTILINE |
| 1295 |
|
|
| 1296 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
| 1297 |
line of characters (even if it actually contains newlines). The "start |
line of characters (even if it actually contains newlines). The "start |
| 1298 |
of line" metacharacter (^) matches only at the start of the string, |
of line" metacharacter (^) matches only at the start of the string, |
| 1299 |
while the "end of line" metacharacter ($) matches only at the end of |
while the "end of line" metacharacter ($) matches only at the end of |
| 1300 |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
| 1301 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
| 1302 |
|
|
| 1303 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
| 1304 |
constructs match immediately following or immediately before internal |
constructs match immediately following or immediately before internal |
| 1305 |
newlines in the subject string, respectively, as well as at the very |
newlines in the subject string, respectively, as well as at the very |
| 1306 |
start and end. This is equivalent to Perl's /m option, and it can be |
start and end. This is equivalent to Perl's /m option, and it can be |
| 1307 |
changed within a pattern by a (?m) option setting. If there are no new- |
changed within a pattern by a (?m) option setting. If there are no new- |
| 1308 |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
| 1309 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
| 1310 |
|
|
| 1311 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
| 1314 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
| 1315 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
| 1316 |
|
|
| 1317 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
| 1318 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
| 1319 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
| 1320 |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
| 1321 |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
| 1322 |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
| 1323 |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
| 1324 |
recognized. The Unicode newline sequences are the three just mentioned, |
recognized. The Unicode newline sequences are the three just mentioned, |
| 1325 |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
| 1326 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
| 1327 |
(paragraph separator, U+2029). The last two are recognized only in |
(paragraph separator, U+2029). The last two are recognized only in |
| 1328 |
UTF-8 mode. |
UTF-8 mode. |
| 1329 |
|
|
| 1330 |
The newline setting in the options word uses three bits that are |
The newline setting in the options word uses three bits that are |
| 1331 |
treated as a number, giving eight possibilities. Currently only six are |
treated as a number, giving eight possibilities. Currently only six are |
| 1332 |
used (default plus the five values above). This means that if you set |
used (default plus the five values above). This means that if you set |
| 1333 |
more than one newline option, the combination may or may not be sensi- |
more than one newline option, the combination may or may not be sensi- |
| 1334 |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
| 1335 |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
| 1336 |
cause an error. |
cause an error. |
| 1337 |
|
|
| 1338 |
The only time that a line break is specially recognized when compiling |
The only time that a line break is specially recognized when compiling |
| 1339 |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
| 1340 |
character class is encountered. This indicates a comment that lasts |
character class is encountered. This indicates a comment that lasts |
| 1341 |
until after the next line break sequence. In other circumstances, line |
until after the next line break sequence. In other circumstances, line |
| 1342 |
break sequences are treated as literal data, except that in |
break sequences are treated as literal data, except that in |
| 1343 |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
| 1344 |
and are therefore ignored. |
and are therefore ignored. |
| 1345 |
|
|
| 1346 |
The newline option that is set at compile time becomes the default that |
The newline option that is set at compile time becomes the default that |
| 1347 |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
| 1348 |
|
|
| 1349 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
| 1350 |
|
|
| 1812 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
| 1813 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
| 1814 |
|
|
| 1815 |
The function pcre_exec() is called to match a subject string against a |
The function pcre_exec() is called to match a subject string against a |
| 1816 |
compiled pattern, which is passed in the code argument. If the pattern |
compiled pattern, which is passed in the code argument. If the pattern |
| 1817 |
has been studied, the result of the study should be passed in the extra |
has been studied, the result of the study should be passed in the extra |
| 1818 |
argument. This function is the main matching facility of the library, |
argument. This function is the main matching facility of the library, |
| 1819 |
and it operates in a Perl-like manner. For specialist use there is also |
and it operates in a Perl-like manner. For specialist use there is also |
| 1820 |
an alternative matching function, which is described below in the sec- |
an alternative matching function, which is described below in the sec- |
| 1821 |
tion about the pcre_dfa_exec() function. |
tion about the pcre_dfa_exec() function. |
| 1822 |
|
|
| 1823 |
In most applications, the pattern will have been compiled (and option- |
In most applications, the pattern will have been compiled (and option- |
| 1824 |
ally studied) in the same process that calls pcre_exec(). However, it |
ally studied) in the same process that calls pcre_exec(). However, it |
| 1825 |
is possible to save compiled patterns and study data, and then use them |
is possible to save compiled patterns and study data, and then use them |
| 1826 |
later in different processes, possibly even on different hosts. For a |
later in different processes, possibly even on different hosts. For a |
| 1827 |
discussion about this, see the pcreprecompile documentation. |
discussion about this, see the pcreprecompile documentation. |
| 1828 |
|
|
| 1829 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_exec(): |
| 1842 |
|
|
| 1843 |
Extra data for pcre_exec() |
Extra data for pcre_exec() |
| 1844 |
|
|
| 1845 |
If the extra argument is not NULL, it must point to a pcre_extra data |
If the extra argument is not NULL, it must point to a pcre_extra data |
| 1846 |
block. The pcre_study() function returns such a block (when it doesn't |
block. The pcre_study() function returns such a block (when it doesn't |
| 1847 |
return NULL), but you can also create one for yourself, and pass addi- |
return NULL), but you can also create one for yourself, and pass addi- |
| 1848 |
tional information in it. The pcre_extra block contains the following |
tional information in it. The pcre_extra block contains the following |
| 1849 |
fields (not necessarily in this order): |
fields (not necessarily in this order): |
| 1850 |
|
|
| 1851 |
unsigned long int flags; |
unsigned long int flags; |
| 1855 |
void *callout_data; |
void *callout_data; |
| 1856 |
const unsigned char *tables; |
const unsigned char *tables; |
| 1857 |
|
|
| 1858 |
The flags field is a bitmap that specifies which of the other fields |
The flags field is a bitmap that specifies which of the other fields |
| 1859 |
are set. The flag bits are: |
are set. The flag bits are: |
| 1860 |
|
|
| 1861 |
PCRE_EXTRA_STUDY_DATA |
PCRE_EXTRA_STUDY_DATA |
| 1864 |
PCRE_EXTRA_CALLOUT_DATA |
PCRE_EXTRA_CALLOUT_DATA |
| 1865 |
PCRE_EXTRA_TABLES |
PCRE_EXTRA_TABLES |
| 1866 |
|
|
| 1867 |
Other flag bits should be set to zero. The study_data field is set in |
Other flag bits should be set to zero. The study_data field is set in |
| 1868 |
the pcre_extra block that is returned by pcre_study(), together with |
the pcre_extra block that is returned by pcre_study(), together with |
| 1869 |
the appropriate flag bit. You should not set this yourself, but you may |
the appropriate flag bit. You should not set this yourself, but you may |
| 1870 |
add to the block by setting the other fields and their corresponding |
add to the block by setting the other fields and their corresponding |
| 1871 |
flag bits. |
flag bits. |
| 1872 |
|
|
| 1873 |
The match_limit field provides a means of preventing PCRE from using up |
The match_limit field provides a means of preventing PCRE from using up |
| 1874 |
a vast amount of resources when running patterns that are not going to |
a vast amount of resources when running patterns that are not going to |
| 1875 |
match, but which have a very large number of possibilities in their |
match, but which have a very large number of possibilities in their |
| 1876 |
search trees. The classic example is the use of nested unlimited |
search trees. The classic example is the use of nested unlimited |
| 1877 |
repeats. |
repeats. |
| 1878 |
|
|
| 1879 |
Internally, PCRE uses a function called match() which it calls repeat- |
Internally, PCRE uses a function called match() which it calls repeat- |
| 1880 |
edly (sometimes recursively). The limit set by match_limit is imposed |
edly (sometimes recursively). The limit set by match_limit is imposed |
| 1881 |
on the number of times this function is called during a match, which |
on the number of times this function is called during a match, which |
| 1882 |
has the effect of limiting the amount of backtracking that can take |
has the effect of limiting the amount of backtracking that can take |
| 1883 |
place. For patterns that are not anchored, the count restarts from zero |
place. For patterns that are not anchored, the count restarts from zero |
| 1884 |
for each position in the subject string. |
for each position in the subject string. |
| 1885 |
|
|
| 1886 |
The default value for the limit can be set when PCRE is built; the |
The default value for the limit can be set when PCRE is built; the |
| 1887 |
default default is 10 million, which handles all but the most extreme |
default default is 10 million, which handles all but the most extreme |
| 1888 |
cases. You can override the default by suppling pcre_exec() with a |
cases. You can override the default by suppling pcre_exec() with a |
| 1889 |
pcre_extra block in which match_limit is set, and |
pcre_extra block in which match_limit is set, and |
| 1890 |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
| 1891 |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
| 1892 |
|
|
| 1893 |
The match_limit_recursion field is similar to match_limit, but instead |
The match_limit_recursion field is similar to match_limit, but instead |
| 1894 |
of limiting the total number of times that match() is called, it limits |
of limiting the total number of times that match() is called, it limits |
| 1895 |
the depth of recursion. The recursion depth is a smaller number than |
the depth of recursion. The recursion depth is a smaller number than |
| 1896 |
the total number of calls, because not all calls to match() are recur- |
the total number of calls, because not all calls to match() are recur- |
| 1897 |
sive. This limit is of use only if it is set smaller than match_limit. |
sive. This limit is of use only if it is set smaller than match_limit. |
| 1898 |
|
|
| 1899 |
Limiting the recursion depth limits the amount of stack that can be |
Limiting the recursion depth limits the amount of stack that can be |
| 1925 |
|
|
| 1926 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
| 1927 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
| 1928 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE, |
| 1929 |
PCRE_PARTIAL. |
PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
| 1930 |
|
|
| 1931 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 1932 |
|
|
| 2020 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
| 2021 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
| 2022 |
|
|
| 2023 |
|
PCRE_NO_START_OPTIMIZE |
| 2024 |
|
|
| 2025 |
|
There are a number of optimizations that pcre_exec() uses at the start |
| 2026 |
|
of a match, in order to speed up the process. For example, if it is |
| 2027 |
|
known that a match must start with a specific character, it searches |
| 2028 |
|
the subject for that character, and fails immediately if it cannot find |
| 2029 |
|
it, without actually running the main matching function. When callouts |
| 2030 |
|
are in use, these optimizations can cause them to be skipped. This |
| 2031 |
|
option disables the "start-up" optimizations, causing performance to |
| 2032 |
|
suffer, but ensuring that the callouts do occur. |
| 2033 |
|
|
| 2034 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
| 2035 |
|
|
| 2036 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
| 2037 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
| 2038 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
| 2039 |
points to the start of a UTF-8 character. There is a discussion about |
points to the start of a UTF-8 character. There is a discussion about |
| 2040 |
the validity of UTF-8 strings in the section on UTF-8 support in the |
the validity of UTF-8 strings in the section on UTF-8 support in the |
| 2041 |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
| 2042 |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
| 2043 |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
| 2044 |
|
|
| 2045 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
| 2046 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
| 2047 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
| 2048 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
| 2049 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
| 2050 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
| 2051 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
| 2052 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
| 2053 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
| 2054 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
| 2055 |
|
|
| 2056 |
PCRE_PARTIAL |
PCRE_PARTIAL |
| 2057 |
|
|
| 2058 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
| 2059 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
| 2060 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
| 2061 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
| 2062 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
| 2063 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
| 2064 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
| 2065 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
| 2066 |
|
|
| 2067 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
| 2068 |
|
|
| 2069 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
| 2070 |
length (in bytes) in length, and a starting byte offset in startoffset. |
length (in bytes) in length, and a starting byte offset in startoffset. |
| 2071 |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
| 2072 |
acter. Unlike the pattern string, the subject may contain binary zero |
acter. Unlike the pattern string, the subject may contain binary zero |
| 2073 |
bytes. When the starting offset is zero, the search for a match starts |
bytes. When the starting offset is zero, the search for a match starts |
| 2074 |
at the beginning of the subject, and this is by far the most common |
at the beginning of the subject, and this is by far the most common |
| 2075 |
case. |
case. |
| 2076 |
|
|
| 2077 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
| 2078 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
| 2079 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
| 2080 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
| 2081 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
| 2082 |
|
|
| 2083 |
\Biss\B |
\Biss\B |
| 2084 |
|
|
| 2085 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
| 2086 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
| 2087 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
| 2088 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
| 2089 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
| 2090 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
| 2091 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
| 2092 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
| 2093 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
| 2094 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
| 2095 |
|
|
| 2096 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
| 2097 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
| 2098 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
| 2099 |
subject. |
subject. |
| 2100 |
|
|
| 2101 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
| 2102 |
|
|
| 2103 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
| 2104 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
| 2105 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
| 2106 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
| 2107 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
| 2108 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
| 2109 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
| 2110 |
|
|
| 2111 |
Captured substrings are returned to the caller via a vector of integers |
Captured substrings are returned to the caller via a vector of integers |
| 2112 |
whose address is passed in ovector. The number of elements in the vec- |
whose address is passed in ovector. The number of elements in the vec- |
| 2113 |
tor is passed in ovecsize, which must be a non-negative number. Note: |
tor is passed in ovecsize, which must be a non-negative number. Note: |
| 2114 |
this argument is NOT the size of ovector in bytes. |
this argument is NOT the size of ovector in bytes. |
| 2115 |
|
|
| 2116 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
| 2117 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
| 2118 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
| 2119 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
| 2120 |
The number passed in ovecsize should always be a multiple of three. If |
The number passed in ovecsize should always be a multiple of three. If |
| 2121 |
it is not, it is rounded down. |
it is not, it is rounded down. |
| 2122 |
|
|
| 2123 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
| 2124 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
| 2125 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
| 2126 |
element of each pair is set to the byte offset of the first character |
element of each pair is set to the byte offset of the first character |
| 2127 |
in a substring, and the second is set to the byte offset of the first |
in a substring, and the second is set to the byte offset of the first |
| 2128 |
character after the end of a substring. Note: these values are always |
character after the end of a substring. Note: these values are always |
| 2129 |
byte offsets, even in UTF-8 mode. They are not character counts. |
byte offsets, even in UTF-8 mode. They are not character counts. |
| 2130 |
|
|
| 2131 |
The first pair of integers, ovector[0] and ovector[1], identify the |
The first pair of integers, ovector[0] and ovector[1], identify the |
| 2132 |
portion of the subject string matched by the entire pattern. The next |
portion of the subject string matched by the entire pattern. The next |
| 2133 |
pair is used for the first capturing subpattern, and so on. The value |
pair is used for the first capturing subpattern, and so on. The value |
| 2134 |
returned by pcre_exec() is one more than the highest numbered pair that |
returned by pcre_exec() is one more than the highest numbered pair that |
| 2135 |
has been set. For example, if two substrings have been captured, the |
has been set. For example, if two substrings have been captured, the |
| 2136 |
returned value is 3. If there are no capturing subpatterns, the return |
returned value is 3. If there are no capturing subpatterns, the return |
| 2137 |
value from a successful match is 1, indicating that just the first pair |
value from a successful match is 1, indicating that just the first pair |
| 2138 |
of offsets has been set. |
of offsets has been set. |
| 2139 |
|
|
| 2140 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
| 2141 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
| 2142 |
|
|
| 2143 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
| 2144 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
| 2145 |
function returns a value of zero. If the substring offsets are not of |
function returns a value of zero. If the substring offsets are not of |
| 2146 |
interest, pcre_exec() may be called with ovector passed as NULL and |
interest, pcre_exec() may be called with ovector passed as NULL and |
| 2147 |
ovecsize as zero. However, if the pattern contains back references and |
ovecsize as zero. However, if the pattern contains back references and |
| 2148 |
the ovector is not big enough to remember the related substrings, PCRE |
the ovector is not big enough to remember the related substrings, PCRE |
| 2149 |
has to get additional memory for use during matching. Thus it is usu- |
has to get additional memory for use during matching. Thus it is usu- |
| 2150 |
ally advisable to supply an ovector. |
ally advisable to supply an ovector. |
| 2151 |
|
|
| 2152 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
| 2153 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
| 2154 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
| 2155 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
| 2156 |
|
|
| 2157 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
| 2158 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
| 2159 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
| 2160 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
| 2161 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
| 2162 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
| 2163 |
|
|
| 2164 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
| 2165 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
| 2166 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
| 2167 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
| 2168 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
| 2169 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
| 2170 |
the vector is large enough, of course). |
the vector is large enough, of course). |
| 2171 |
|
|
| 2172 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
| 2173 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
| 2174 |
|
|
| 2175 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
| 2176 |
|
|
| 2177 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
| 2178 |
defined in the header file: |
defined in the header file: |
| 2179 |
|
|
| 2180 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
| 2183 |
|
|
| 2184 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
| 2185 |
|
|
| 2186 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
| 2187 |
ovecsize was not zero. |
ovecsize was not zero. |
| 2188 |
|
|
| 2189 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
| 2192 |
|
|
| 2193 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
| 2194 |
|
|
| 2195 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
| 2196 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
| 2197 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
| 2198 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
| 2199 |
gives when the magic number is not present. |
gives when the magic number is not present. |
| 2200 |
|
|
| 2201 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
| 2202 |
|
|
| 2203 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
| 2204 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
| 2205 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
| 2206 |
|
|
| 2207 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2208 |
|
|
| 2209 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
| 2210 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
| 2211 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
| 2212 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
| 2213 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
| 2214 |
|
|
| 2215 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 2216 |
|
|
| 2217 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
| 2218 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
| 2219 |
returned by pcre_exec(). |
returned by pcre_exec(). |
| 2220 |
|
|
| 2221 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
| 2222 |
|
|
| 2223 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
| 2224 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
| 2225 |
above. |
above. |
| 2226 |
|
|
| 2227 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
| 2228 |
|
|
| 2229 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
| 2230 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
| 2231 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
| 2232 |
|
|
| 2233 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
| 2234 |
|
|
| 2235 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
| 2236 |
subject. |
subject. |
| 2237 |
|
|
| 2238 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
| 2239 |
|
|
| 2240 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
| 2241 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
| 2242 |
ter. |
ter. |
| 2243 |
|
|
| 2244 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
| 2245 |
|
|
| 2246 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
| 2247 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
| 2248 |
|
|
| 2249 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
| 2250 |
|
|
| 2251 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
| 2252 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
| 2253 |
documentation for details of partial matching. |
documentation for details of partial matching. |
| 2254 |
|
|
| 2255 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
| 2256 |
|
|
| 2257 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
| 2258 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
| 2259 |
|
|
| 2260 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
| 2261 |
|
|
| 2262 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
| 2263 |
|
|
| 2264 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
| 2265 |
|
|
| 2409 |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
| 2410 |
behaviour may not be what you want (see the next section). |
behaviour may not be what you want (see the next section). |
| 2411 |
|
|
| 2412 |
|
Warning: If the pattern uses the "(?|" feature to set up multiple sub- |
| 2413 |
|
patterns with the same number, you cannot use names to distinguish |
| 2414 |
|
them, because names are not included in the compiled code. The matching |
| 2415 |
|
process uses only numbers. |
| 2416 |
|
|
| 2417 |
|
|
| 2418 |
DUPLICATE SUBPATTERN NAMES |
DUPLICATE SUBPATTERN NAMES |
| 2419 |
|
|
| 2420 |
int pcre_get_stringtable_entries(const pcre *code, |
int pcre_get_stringtable_entries(const pcre *code, |
| 2421 |
const char *name, char **first, char **last); |
const char *name, char **first, char **last); |
| 2422 |
|
|
| 2423 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
| 2424 |
subpatterns are not required to be unique. Normally, patterns with |
subpatterns are not required to be unique. Normally, patterns with |
| 2425 |
duplicate names are such that in any one match, only one of the named |
duplicate names are such that in any one match, only one of the named |
| 2426 |
subpatterns participates. An example is shown in the pcrepattern docu- |
subpatterns participates. An example is shown in the pcrepattern docu- |
| 2427 |
mentation. |
mentation. |
| 2428 |
|
|
| 2429 |
When duplicates are present, pcre_copy_named_substring() and |
When duplicates are present, pcre_copy_named_substring() and |
| 2430 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
| 2431 |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
| 2432 |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
| 2433 |
function returns one of the numbers that are associated with the name, |
function returns one of the numbers that are associated with the name, |
| 2434 |
but it is not defined which it is. |
but it is not defined which it is. |
| 2435 |
|
|
| 2436 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
| 2437 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
| 2438 |
first argument is the compiled pattern, and the second is the name. The |
first argument is the compiled pattern, and the second is the name. The |
| 2439 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
| 2440 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
| 2441 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
| 2442 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
| 2443 |
there are none. The format of the table is described above in the sec- |
there are none. The format of the table is described above in the sec- |
| 2444 |
tion entitled Information about a pattern. Given all the relevant |
tion entitled Information about a pattern. Given all the relevant |
| 2445 |
entries for the name, you can extract each of their numbers, and hence |
entries for the name, you can extract each of their numbers, and hence |
| 2446 |
the captured data, if any. |
the captured data, if any. |
| 2447 |
|
|
| 2448 |
|
|
| 2449 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
| 2450 |
|
|
| 2451 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
| 2452 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
| 2453 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
| 2454 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
| 2455 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
| 2456 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
| 2457 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
| 2458 |
tation. |
tation. |
| 2459 |
|
|
| 2460 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
| 2461 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
| 2462 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
| 2463 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
| 2464 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
| 2465 |
|
|
| 2466 |
|
|
| 2471 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
| 2472 |
int *workspace, int wscount); |
int *workspace, int wscount); |
| 2473 |
|
|
| 2474 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
| 2475 |
against a compiled pattern, using a matching algorithm that scans the |
against a compiled pattern, using a matching algorithm that scans the |
| 2476 |
subject string just once, and does not backtrack. This has different |
subject string just once, and does not backtrack. This has different |
| 2477 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
| 2478 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
| 2479 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
| 2480 |
a discussion of the two matching algorithms, see the pcrematching docu- |
a discussion of the two matching algorithms, see the pcrematching docu- |
| 2481 |
mentation. |
mentation. |
| 2482 |
|
|
| 2483 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
| 2484 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
| 2485 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
| 2486 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
| 2487 |
repeated here. |
repeated here. |
| 2488 |
|
|
| 2489 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
| 2490 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
| 2491 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
| 2492 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
| 2493 |
lot of potential matches. |
lot of potential matches. |
| 2494 |
|
|
| 2495 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
| 2511 |
|
|
| 2512 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
| 2513 |
|
|
| 2514 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
| 2515 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
| 2516 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
| 2517 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
| 2518 |
three of these are the same as for pcre_exec(), so their description is |
three of these are the same as for pcre_exec(), so their description is |
| 2519 |
not repeated here. |
not repeated here. |
| 2520 |
|
|
| 2521 |
PCRE_PARTIAL |
PCRE_PARTIAL |
| 2522 |
|
|
| 2523 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
| 2524 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
| 2525 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
| 2526 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
| 2527 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
| 2528 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
| 2529 |
set as the first matching string. |
set as the first matching string. |
| 2530 |
|
|
| 2531 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
| 2532 |
|
|
| 2533 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
| 2534 |
stop as soon as it has found one match. Because of the way the alterna- |
stop as soon as it has found one match. Because of the way the alterna- |
| 2535 |
tive algorithm works, this is necessarily the shortest possible match |
tive algorithm works, this is necessarily the shortest possible match |
| 2536 |
at the first possible matching point in the subject string. |
at the first possible matching point in the subject string. |
| 2537 |
|
|
| 2538 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
| 2539 |
|
|
| 2540 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
| 2541 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
| 2542 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
| 2543 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
| 2544 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
| 2545 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
| 2546 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
| 2547 |
documentation. |
documentation. |
| 2548 |
|
|
| 2549 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
| 2550 |
|
|
| 2551 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
| 2552 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
| 2553 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
| 2554 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
| 2555 |
if the pattern |
if the pattern |
| 2556 |
|
|
| 2557 |
<.*> |
<.*> |
| 2566 |
<something> <something else> |
<something> <something else> |
| 2567 |
<something> <something else> <something further> |
<something> <something else> <something further> |
| 2568 |
|
|
| 2569 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
| 2570 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
| 2571 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
| 2572 |
the offset to the start, and the second is the offset to the end. In |
the offset to the start, and the second is the offset to the end. In |
| 2573 |
fact, all the strings have the same start offset. (Space could have |
fact, all the strings have the same start offset. (Space could have |
| 2574 |
been saved by giving this only once, but it was decided to retain some |
been saved by giving this only once, but it was decided to retain some |
| 2575 |
compatibility with the way pcre_exec() returns data, even though the |
compatibility with the way pcre_exec() returns data, even though the |
| 2576 |
meaning of the strings is different.) |
meaning of the strings is different.) |
| 2577 |
|
|
| 2578 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
| 2579 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
| 2580 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
| 2581 |
filled with the longest matches. |
filled with the longest matches. |
| 2582 |
|
|
| 2583 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
| 2584 |
|
|
| 2585 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
| 2586 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
| 2587 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
| 2588 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
| 2589 |
|
|
| 2590 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
| 2591 |
|
|
| 2592 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
| 2593 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
| 2594 |
reference. |
reference. |
| 2595 |
|
|
| 2596 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
| 2597 |
|
|
| 2598 |
This return is given if pcre_dfa_exec() encounters a condition item |
This return is given if pcre_dfa_exec() encounters a condition item |
| 2599 |
that uses a back reference for the condition, or a test for recursion |
that uses a back reference for the condition, or a test for recursion |
| 2600 |
in a specific group. These are not supported. |
in a specific group. These are not supported. |
| 2601 |
|
|
| 2602 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
| 2603 |
|
|
| 2604 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
| 2605 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
| 2606 |
(it is meaningless). |
(it is meaningless). |
| 2607 |
|
|
| 2608 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
| 2609 |
|
|
| 2610 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
| 2611 |
workspace vector. |
workspace vector. |
| 2612 |
|
|
| 2613 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
| 2614 |
|
|
| 2615 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
| 2616 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
| 2617 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
| 2618 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
| 2619 |
|
|
| 2620 |
|
|
| 2621 |
SEE ALSO |
SEE ALSO |
| 2622 |
|
|
| 2623 |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
| 2624 |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
| 2625 |
|
|
| 2626 |
|
|
| 2627 |
AUTHOR |
AUTHOR |
| 2633 |
|
|
| 2634 |
REVISION |
REVISION |
| 2635 |
|
|
| 2636 |
Last updated: 24 August 2008 |
Last updated: 17 March 2009 |
| 2637 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 2638 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2639 |
|
|
| 2640 |
|
|
| 2685 |
MISSING CALLOUTS |
MISSING CALLOUTS |
| 2686 |
|
|
| 2687 |
You should be aware that, because of optimizations in the way PCRE |
You should be aware that, because of optimizations in the way PCRE |
| 2688 |
matches patterns, callouts sometimes do not happen. For example, if the |
matches patterns by default, callouts sometimes do not happen. For |
| 2689 |
pattern is |
example, if the pattern is |
| 2690 |
|
|
| 2691 |
ab(?C4)cd |
ab(?C4)cd |
| 2692 |
|
|
| 2695 |
ever start, and the callout is never reached. However, with "abyd", |
ever start, and the callout is never reached. However, with "abyd", |
| 2696 |
though the result is still no match, the callout is obeyed. |
though the result is still no match, the callout is obeyed. |
| 2697 |
|
|
| 2698 |
|
You can disable these optimizations by passing the PCRE_NO_START_OPTI- |
| 2699 |
|
MIZE option to pcre_exec() or pcre_dfa_exec(). This slows down the |
| 2700 |
|
matching process, but does ensure that callouts such as the example |
| 2701 |
|
above are obeyed. |
| 2702 |
|
|
| 2703 |
|
|
| 2704 |
THE CALLOUT INTERFACE |
THE CALLOUT INTERFACE |
| 2705 |
|
|
| 2706 |
During matching, when PCRE reaches a callout point, the external func- |
During matching, when PCRE reaches a callout point, the external func- |
| 2707 |
tion defined by pcre_callout is called (if it is set). This applies to |
tion defined by pcre_callout is called (if it is set). This applies to |
| 2708 |
both the pcre_exec() and the pcre_dfa_exec() matching functions. The |
both the pcre_exec() and the pcre_dfa_exec() matching functions. The |
| 2709 |
only argument to the callout function is a pointer to a pcre_callout |
only argument to the callout function is a pointer to a pcre_callout |
| 2710 |
block. This structure contains the following fields: |
block. This structure contains the following fields: |
| 2711 |
|
|
| 2712 |
int version; |
int version; |
| 2722 |
int pattern_position; |
int pattern_position; |
| 2723 |
int next_item_length; |
int next_item_length; |
| 2724 |
|
|
| 2725 |
The version field is an integer containing the version number of the |
The version field is an integer containing the version number of the |
| 2726 |
block format. The initial version was 0; the current version is 1. The |
block format. The initial version was 0; the current version is 1. The |
| 2727 |
version number will change again in future if additional fields are |
version number will change again in future if additional fields are |
| 2728 |
added, but the intention is never to remove any of the existing fields. |
added, but the intention is never to remove any of the existing fields. |
| 2729 |
|
|
| 2730 |
The callout_number field contains the number of the callout, as com- |
The callout_number field contains the number of the callout, as com- |
| 2809 |
|
|
| 2810 |
REVISION |
REVISION |
| 2811 |
|
|
| 2812 |
Last updated: 29 May 2007 |
Last updated: 15 March 2009 |
| 2813 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 2814 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2815 |
|
|
| 2816 |
|
|
| 3089 |
syntax) |
syntax) |
| 3090 |
] terminates the character class |
] terminates the character class |
| 3091 |
|
|
| 3092 |
The following sections describe the use of each of the metacharacters. |
The following sections describe the use of each of the metacharacters. |
| 3093 |
|
|
| 3094 |
|
|
| 3095 |
BACKSLASH |
BACKSLASH |
| 3096 |
|
|
| 3097 |
The backslash character has several uses. Firstly, if it is followed by |
The backslash character has several uses. Firstly, if it is followed by |
| 3098 |
a non-alphanumeric character, it takes away any special meaning that |
a non-alphanumeric character, it takes away any special meaning that |
| 3099 |
character may have. This use of backslash as an escape character |
character may have. This use of backslash as an escape character |
| 3100 |
applies both inside and outside character classes. |
applies both inside and outside character classes. |
| 3101 |
|
|
| 3102 |
For example, if you want to match a * character, you write \* in the |
For example, if you want to match a * character, you write \* in the |
| 3103 |
pattern. This escaping action applies whether or not the following |
pattern. This escaping action applies whether or not the following |
| 3104 |
character would otherwise be interpreted as a metacharacter, so it is |
character would otherwise be interpreted as a metacharacter, so it is |
| 3105 |
always safe to precede a non-alphanumeric with backslash to specify |
always safe to precede a non-alphanumeric with backslash to specify |
| 3106 |
that it stands for itself. In particular, if you want to match a back- |
that it stands for itself. In particular, if you want to match a back- |
| 3107 |
slash, you write \\. |
slash, you write \\. |
| 3108 |
|
|
| 3109 |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
| 3110 |
the pattern (other than in a character class) and characters between a |
the pattern (other than in a character class) and characters between a |
| 3111 |
# outside a character class and the next newline are ignored. An escap- |
# outside a character class and the next newline are ignored. An escap- |
| 3112 |
ing backslash can be used to include a whitespace or # character as |
ing backslash can be used to include a whitespace or # character as |
| 3113 |
part of the pattern. |
part of the pattern. |
| 3114 |
|
|
| 3115 |
If you want to remove the special meaning from a sequence of charac- |
If you want to remove the special meaning from a sequence of charac- |
| 3116 |
ters, you can do so by putting them between \Q and \E. This is differ- |
ters, you can do so by putting them between \Q and \E. This is differ- |
| 3117 |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
| 3118 |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
| 3119 |
tion. Note the following examples: |
tion. Note the following examples: |
| 3120 |
|
|
| 3121 |
Pattern PCRE matches Perl matches |
Pattern PCRE matches Perl matches |
| 3125 |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
| 3126 |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| 3127 |
|
|
| 3128 |
The \Q...\E sequence is recognized both inside and outside character |
The \Q...\E sequence is recognized both inside and outside character |
| 3129 |
classes. |
classes. |
| 3130 |
|
|
| 3131 |
Non-printing characters |
Non-printing characters |
| 3132 |
|
|
| 3133 |
A second use of backslash provides a way of encoding non-printing char- |
A second use of backslash provides a way of encoding non-printing char- |
| 3134 |
acters in patterns in a visible manner. There is no restriction on the |
acters in patterns in a visible manner. There is no restriction on the |
| 3135 |
appearance of non-printing characters, apart from the binary zero that |
appearance of non-printing characters, apart from the binary zero that |
| 3136 |
terminates a pattern, but when a pattern is being prepared by text |
terminates a pattern, but when a pattern is being prepared by text |
| 3137 |
editing, it is usually easier to use one of the following escape |
editing, it is usually easier to use one of the following escape |
| 3138 |
sequences than the binary character it represents: |
sequences than the binary character it represents: |
| 3139 |
|
|
| 3140 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
| 3148 |
\xhh character with hex code hh |
\xhh character with hex code hh |
| 3149 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. |
| 3150 |
|
|
| 3151 |
The precise effect of \cx is as follows: if x is a lower case letter, |
The precise effect of \cx is as follows: if x is a lower case letter, |
| 3152 |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
| 3153 |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
| 3154 |
becomes hex 7B. |
becomes hex 7B. |
| 3155 |
|
|
| 3156 |
After \x, from zero to two hexadecimal digits are read (letters can be |
After \x, from zero to two hexadecimal digits are read (letters can be |
| 3157 |
in upper or lower case). Any number of hexadecimal digits may appear |
in upper or lower case). Any number of hexadecimal digits may appear |
| 3158 |
between \x{ and }, but the value of the character code must be less |
between \x{ and }, but the value of the character code must be less |
| 3159 |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
| 3160 |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
| 3161 |
than the largest Unicode code point, which is 10FFFF. |
than the largest Unicode code point, which is 10FFFF. |
| 3162 |
|
|
| 3163 |
If characters other than hexadecimal digits appear between \x{ and }, |
If characters other than hexadecimal digits appear between \x{ and }, |
| 3164 |
or if there is no terminating }, this form of escape is not recognized. |
or if there is no terminating }, this form of escape is not recognized. |
| 3165 |
Instead, the initial \x will be interpreted as a basic hexadecimal |
Instead, the initial \x will be interpreted as a basic hexadecimal |
| 3166 |
escape, with no following digits, giving a character whose value is |
escape, with no following digits, giving a character whose value is |
| 3167 |
zero. |
zero. |
| 3168 |
|
|
| 3169 |
Characters whose value is less than 256 can be defined by either of the |
Characters whose value is less than 256 can be defined by either of the |
| 3170 |
two syntaxes for \x. There is no difference in the way they are han- |
two syntaxes for \x. There is no difference in the way they are han- |
| 3171 |
dled. For example, \xdc is exactly the same as \x{dc}. |
dled. For example, \xdc is exactly the same as \x{dc}. |
| 3172 |
|
|
| 3173 |
After \0 up to two further octal digits are read. If there are fewer |
After \0 up to two further octal digits are read. If there are fewer |
| 3174 |
than two digits, just those that are present are used. Thus the |
than two digits, just those that are present are used. Thus the |
| 3175 |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
| 3176 |
(code value 7). Make sure you supply two digits after the initial zero |
(code value 7). Make sure you supply two digits after the initial zero |
| 3177 |
if the pattern character that follows is itself an octal digit. |
if the pattern character that follows is itself an octal digit. |
| 3178 |
|
|
| 3179 |
The handling of a backslash followed by a digit other than 0 is compli- |
The handling of a backslash followed by a digit other than 0 is compli- |
| 3180 |
cated. Outside a character class, PCRE reads it and any following dig- |
cated. Outside a character class, PCRE reads it and any following dig- |
| 3181 |
its as a decimal number. If the number is less than 10, or if there |
its as a decimal number. If the number is less than 10, or if there |
| 3182 |
have been at least that many previous capturing left parentheses in the |
have been at least that many previous capturing left parentheses in the |
| 3183 |
expression, the entire sequence is taken as a back reference. A |
expression, the entire sequence is taken as a back reference. A |
| 3184 |
description of how this works is given later, following the discussion |
description of how this works is given later, following the discussion |
| 3185 |
of parenthesized subpatterns. |
of parenthesized subpatterns. |
| 3186 |
|
|
| 3187 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
| 3188 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
| 3189 |
up to three octal digits following the backslash, and uses them to gen- |
up to three octal digits following the backslash, and uses them to gen- |
| 3190 |
erate a data character. Any subsequent digits stand for themselves. In |
erate a data character. Any subsequent digits stand for themselves. In |
| 3191 |
non-UTF-8 mode, the value of a character specified in octal must be |
non-UTF-8 mode, the value of a character specified in octal must be |
| 3192 |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
| 3193 |
example: |
example: |
| 3194 |
|
|
| 3195 |
\040 is another way of writing a space |
\040 is another way of writing a space |
| 3207 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
| 3208 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
| 3209 |
|
|
| 3210 |
Note that octal values of 100 or greater must not be introduced by a |
Note that octal values of 100 or greater must not be introduced by a |
| 3211 |
leading zero, because no more than three octal digits are ever read. |
leading zero, because no more than three octal digits are ever read. |
| 3212 |
|
|
| 3213 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
| 3214 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
| 3215 |
class, the sequence \b is interpreted as the backspace character (hex |
class, the sequence \b is interpreted as the backspace character (hex |
| 3216 |
08), and the sequences \R and \X are interpreted as the characters "R" |
08), and the sequences \R and \X are interpreted as the characters "R" |
| 3217 |
and "X", respectively. Outside a character class, these sequences have |
and "X", respectively. Outside a character class, these sequences have |
| 3218 |
different meanings (see below). |
different meanings (see below). |
| 3219 |
|
|
| 3220 |
Absolute and relative back references |
Absolute and relative back references |
| 3221 |
|
|
| 3222 |
The sequence \g followed by an unsigned or a negative number, option- |
The sequence \g followed by an unsigned or a negative number, option- |
| 3223 |
ally enclosed in braces, is an absolute or relative back reference. A |
ally enclosed in braces, is an absolute or relative back reference. A |
| 3224 |
named back reference can be coded as \g{name}. Back references are dis- |
named back reference can be coded as \g{name}. Back references are dis- |
| 3225 |
cussed later, following the discussion of parenthesized subpatterns. |
cussed later, following the discussion of parenthesized subpatterns. |
| 3226 |
|
|
| 3227 |
Absolute and relative subroutine calls |
Absolute and relative subroutine calls |
| 3228 |
|
|
| 3229 |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
| 3230 |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
| 3231 |
an alternative syntax for referencing a subpattern as a "subroutine". |
an alternative syntax for referencing a subpattern as a "subroutine". |
| 3232 |
Details are discussed later. Note that \g{...} (Perl syntax) and |
Details are discussed later. Note that \g{...} (Perl syntax) and |
| 3233 |
\g<...> (Oniguruma syntax) are not synonymous. The former is a back |
\g<...> (Oniguruma syntax) are not synonymous. The former is a back |
| 3234 |
reference; the latter is a subroutine call. |
reference; the latter is a subroutine call. |
| 3235 |
|
|
| 3236 |
Generic character types |
Generic character types |
| 3250 |
\W any "non-word" character |
\W any "non-word" character |
| 3251 |
|
|
| 3252 |
Each pair of escape sequences partitions the complete set of characters |
Each pair of escape sequences partitions the complete set of characters |
| 3253 |
into two disjoint sets. Any given character matches one, and only one, |
into two disjoint sets. Any given character matches one, and only one, |
| 3254 |
of each pair. |
of each pair. |
| 3255 |
|
|
| 3256 |
These character type sequences can appear both inside and outside char- |
These character type sequences can appear both inside and outside char- |
| 3257 |
acter classes. They each match one character of the appropriate type. |
acter classes. They each match one character of the appropriate type. |
| 3258 |
If the current matching point is at the end of the subject string, all |
If the current matching point is at the end of the subject string, all |
| 3259 |
of them fail, since there is no character to match. |
of them fail, since there is no character to match. |
| 3260 |
|
|
| 3261 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
| 3262 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
| 3263 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
| 3264 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
| 3265 |
ter. In PCRE, it never does. |
ter. In PCRE, it never does. |
| 3266 |
|
|
| 3267 |
In UTF-8 mode, characters with values greater than 128 never match \d, |
In UTF-8 mode, characters with values greater than 128 never match \d, |
| 3268 |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
| 3269 |
code character property support is available. These sequences retain |
code character property support is available. These sequences retain |
| 3270 |
their original meanings from before UTF-8 support was available, mainly |
their original meanings from before UTF-8 support was available, mainly |
| 3271 |
for efficiency reasons. |
for efficiency reasons. |
| 3272 |
|
|
| 3273 |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
| 3274 |
the other sequences, these do match certain high-valued codepoints in |
the other sequences, these do match certain high-valued codepoints in |
| 3275 |
UTF-8 mode. The horizontal space characters are: |
UTF-8 mode. The horizontal space characters are: |
| 3276 |
|
|
| 3277 |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
| 3305 |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
| 3306 |
|
|
| 3307 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
| 3308 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
| 3309 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
| 3310 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
| 3311 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
| 3312 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
| 3313 |
are used for accented letters, and these are matched by \w. The use of |
are used for accented letters, and these are matched by \w. The use of |
| 3314 |
locales with Unicode is discouraged. |
locales with Unicode is discouraged. |
| 3315 |
|
|
| 3316 |
Newline sequences |
Newline sequences |
| 3317 |
|
|
| 3318 |
Outside a character class, by default, the escape sequence \R matches |
Outside a character class, by default, the escape sequence \R matches |
| 3319 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
| 3320 |
mode \R is equivalent to the following: |
mode \R is equivalent to the following: |
| 3321 |
|
|
| 3322 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
| 3323 |
|
|
| 3324 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
| 3325 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
| 3326 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
| 3327 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
| 3328 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
| 3329 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
| 3330 |
|
|
| 3331 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
| 3332 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
| 3333 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
| 3334 |
these characters to be recognized. |
these characters to be recognized. |
| 3335 |
|
|
| 3336 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
| 3337 |
the complete set of Unicode line endings) by setting the option |
the complete set of Unicode line endings) by setting the option |
| 3338 |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
| 3339 |
(BSR is an abbrevation for "backslash R".) This can be made the default |
(BSR is an abbrevation for "backslash R".) This can be made the default |
| 3340 |
when PCRE is built; if this is the case, the other behaviour can be |
when PCRE is built; if this is the case, the other behaviour can be |
| 3341 |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
| 3342 |
specify these settings by starting a pattern string with one of the |
specify these settings by starting a pattern string with one of the |
| 3343 |
following sequences: |
following sequences: |
| 3344 |
|
|
| 3345 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
| 3348 |
These override the default and the options given to pcre_compile(), but |
These override the default and the options given to pcre_compile(), but |
| 3349 |
they can be overridden by options given to pcre_exec(). Note that these |
they can be overridden by options given to pcre_exec(). Note that these |
| 3350 |
special settings, which are not Perl-compatible, are recognized only at |
special settings, which are not Perl-compatible, are recognized only at |
| 3351 |
the very start of a pattern, and that they must be in upper case. If |
the very start of a pattern, and that they must be in upper case. If |
| 3352 |
more than one of them is present, the last one is used. They can be |
more than one of them is present, the last one is used. They can be |
| 3353 |
combined with a change of newline convention, for example, a pattern |
combined with a change of newline convention, for example, a pattern |
| 3354 |
can start with: |
can start with: |
| 3355 |
|
|
| 3356 |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
| 3360 |
Unicode character properties |
Unicode character properties |
| 3361 |
|
|
| 3362 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
| 3363 |
tional escape sequences that match characters with specific properties |
tional escape sequences that match characters with specific properties |
| 3364 |
are available. When not in UTF-8 mode, these sequences are of course |
are available. When not in UTF-8 mode, these sequences are of course |
| 3365 |
limited to testing characters whose codepoints are less than 256, but |
limited to testing characters whose codepoints are less than 256, but |
| 3366 |
they do work in this mode. The extra escape sequences are: |
they do work in this mode. The extra escape sequences are: |
| 3367 |
|
|
| 3368 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
| 3369 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
| 3370 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
| 3371 |
|
|
| 3372 |
The property names represented by xx above are limited to the Unicode |
The property names represented by xx above are limited to the Unicode |
| 3373 |
script names, the general category properties, and "Any", which matches |
script names, the general category properties, and "Any", which matches |
| 3374 |
any character (including newline). Other properties such as "InMusical- |
any character (including newline). Other properties such as "InMusical- |
| 3375 |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
| 3376 |
not match any characters, so always causes a match failure. |
not match any characters, so always causes a match failure. |
| 3377 |
|
|
| 3378 |
Sets of Unicode characters are defined as belonging to certain scripts. |
Sets of Unicode characters are defined as belonging to certain scripts. |
| 3379 |
A character from one of these sets can be matched using a script name. |
A character from one of these sets can be matched using a script name. |
| 3380 |
For example: |
For example: |
| 3381 |
|
|
| 3382 |
\p{Greek} |
\p{Greek} |
| 3383 |
\P{Han} |
\P{Han} |
| 3384 |
|
|
| 3385 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
| 3386 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
| 3387 |
|
|
| 3388 |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
| 3389 |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
| 3390 |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
| 3391 |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
| 3392 |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
| 3393 |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
| 3394 |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
| 3395 |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
| 3396 |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
| 3397 |
|
|
| 3398 |
Each character has exactly one general category property, specified by |
Each character has exactly one general category property, specified by |
| 3399 |
a two-letter abbreviation. For compatibility with Perl, negation can be |
a two-letter abbreviation. For compatibility with Perl, negation can be |
| 3400 |
specified by including a circumflex between the opening brace and the |
specified by including a circumflex between the opening brace and the |
| 3401 |
property name. For example, \p{^Lu} is the same as \P{Lu}. |
property name. For example, \p{^Lu} is the same as \P{Lu}. |
| 3402 |
|
|
| 3403 |
If only one letter is specified with \p or \P, it includes all the gen- |
If only one letter is specified with \p or \P, it includes all the gen- |
| 3404 |
eral category properties that start with that letter. In this case, in |
eral category properties that start with that letter. In this case, in |
| 3405 |
the absence of negation, the curly brackets in the escape sequence are |
the absence of negation, the curly brackets in the escape sequence are |
| 3406 |
optional; these two examples have the same effect: |
optional; these two examples have the same effect: |
| 3407 |
|
|
| 3408 |
\p{L} |
\p{L} |
| 3454 |
Zp Paragraph separator |
Zp Paragraph separator |
| 3455 |
Zs Space separator |
Zs Space separator |
| 3456 |
|
|
| 3457 |
The special property L& is also supported: it matches a character that |
The special property L& is also supported: it matches a character that |
| 3458 |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
| 3459 |
classified as a modifier or "other". |
classified as a modifier or "other". |
| 3460 |
|
|
| 3461 |
The Cs (Surrogate) property applies only to characters in the range |
The Cs (Surrogate) property applies only to characters in the range |
| 3462 |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
| 3463 |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
| 3464 |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
| 3465 |
the pcreapi page). |
the pcreapi page). |
| 3466 |
|
|
| 3467 |
The long synonyms for these properties that Perl supports (such as |
The long synonyms for these properties that Perl supports (such as |
| 3468 |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
| 3469 |
any of these properties with "Is". |
any of these properties with "Is". |
| 3470 |
|
|
| 3471 |
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
| 3472 |
erty. Instead, this property is assumed for any code point that is not |
erty. Instead, this property is assumed for any code point that is not |
| 3473 |
in the Unicode table. |
in the Unicode table. |
| 3474 |
|
|
| 3475 |
Specifying caseless matching does not affect these escape sequences. |
Specifying caseless matching does not affect these escape sequences. |
| 3476 |
For example, \p{Lu} always matches only upper case letters. |
For example, \p{Lu} always matches only upper case letters. |
| 3477 |
|
|
| 3478 |
The \X escape matches any number of Unicode characters that form an |
The \X escape matches any number of Unicode characters that form an |
| 3479 |
extended Unicode sequence. \X is equivalent to |
extended Unicode sequence. \X is equivalent to |
| 3480 |
|
|
| 3481 |
(?>\PM\pM*) |
(?>\PM\pM*) |
| 3482 |
|
|
| 3483 |
That is, it matches a character without the "mark" property, followed |
That is, it matches a character without the "mark" property, followed |
| 3484 |
by zero or more characters with the "mark" property, and treats the |
by zero or more characters with the "mark" property, and treats the |
| 3485 |
sequence as an atomic group (see below). Characters with the "mark" |
sequence as an atomic group (see below). Characters with the "mark" |
| 3486 |
property are typically accents that affect the preceding character. |
property are typically accents that affect the preceding character. |
| 3487 |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
| 3488 |
matches any one character. |
matches any one character. |
| 3489 |
|
|
| 3490 |
Matching characters by Unicode property is not fast, because PCRE has |
Matching characters by Unicode property is not fast, because PCRE has |
| 3491 |
to search a structure that contains data for over fifteen thousand |
to search a structure that contains data for over fifteen thousand |
| 3492 |
characters. That is why the traditional escape sequences such as \d and |
characters. That is why the traditional escape sequences such as \d and |
| 3493 |
\w do not use Unicode properties in PCRE. |
\w do not use Unicode properties in PCRE. |
| 3494 |
|
|
| 3495 |
Resetting the match start |
Resetting the match start |
| 3496 |
|
|
| 3497 |
The escape sequence \K, which is a Perl 5.10 feature, causes any previ- |
The escape sequence \K, which is a Perl 5.10 feature, causes any previ- |
| 3498 |
ously matched characters not to be included in the final matched |
ously matched characters not to be included in the final matched |
| 3499 |
sequence. For example, the pattern: |
sequence. For example, the pattern: |
| 3500 |
|
|
| 3501 |
foo\Kbar |
foo\Kbar |
| 3502 |
|
|
| 3503 |
matches "foobar", but reports that it has matched "bar". This feature |
matches "foobar", but reports that it has matched "bar". This feature |
| 3504 |
is similar to a lookbehind assertion (described below). However, in |
is similar to a lookbehind assertion (described below). However, in |
| 3505 |
this case, the part of the subject before the real match does not have |
this case, the part of the subject before the real match does not have |
| 3506 |
to be of fixed length, as lookbehind assertions do. The use of \K does |
to be of fixed length, as lookbehind assertions do. The use of \K does |
| 3507 |
not interfere with the setting of captured substrings. For example, |
not interfere with the setting of captured substrings. For example, |
| 3508 |
when the pattern |
when the pattern |
| 3509 |
|
|
| 3510 |
(foo)\Kbar |
(foo)\Kbar |
| 3513 |
|
|
| 3514 |
Simple assertions |
Simple assertions |
| 3515 |
|
|
| 3516 |
The final use of backslash is for certain simple assertions. An asser- |
The final use of backslash is for certain simple assertions. An asser- |
| 3517 |
tion specifies a condition that has to be met at a particular point in |
tion specifies a condition that has to be met at a particular point in |
| 3518 |
a match, without consuming any characters from the subject string. The |
a match, without consuming any characters from the subject string. The |
| 3519 |
use of subpatterns for more complicated assertions is described below. |
use of subpatterns for more complicated assertions is described below. |
| 3520 |
The backslashed assertions are: |
The backslashed assertions are: |
| 3521 |
|
|
| 3522 |
\b matches at a word boundary |
\b matches at a word boundary |
| 3527 |
\z matches only at the end of the subject |
\z matches only at the end of the subject |
| 3528 |
\G matches at the first matching position in the subject |
\G matches at the first matching position in the subject |
| 3529 |
|
|
| 3530 |
These assertions may not appear in character classes (but note that \b |
These assertions may not appear in character classes (but note that \b |
| 3531 |
has a different meaning, namely the backspace character, inside a char- |
has a different meaning, namely the backspace character, inside a char- |
| 3532 |
acter class). |
acter class). |
| 3533 |
|
|
| 3534 |
A word boundary is a position in the subject string where the current |
A word boundary is a position in the subject string where the current |
| 3535 |
character and the previous character do not both match \w or \W (i.e. |
character and the previous character do not both match \w or \W (i.e. |
| 3536 |
one matches \w and the other matches \W), or the start or end of the |
one matches \w and the other matches \W), or the start or end of the |
| 3537 |
string if the first or last character matches \w, respectively. |
string if the first or last character matches \w, respectively. |
| 3538 |
|
|
| 3539 |
The \A, \Z, and \z assertions differ from the traditional circumflex |
The \A, \Z, and \z assertions differ from the traditional circumflex |
| 3540 |
and dollar (described in the next section) in that they only ever match |
and dollar (described in the next section) in that they only ever match |
| 3541 |
at the very start and end of the subject string, whatever options are |
at the very start and end of the subject string, whatever options are |
| 3542 |
set. Thus, they are independent of multiline mode. These three asser- |
set. Thus, they are independent of multiline mode. These three asser- |
| 3543 |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
| 3544 |
affect only the behaviour of the circumflex and dollar metacharacters. |
affect only the behaviour of the circumflex and dollar metacharacters. |
| 3545 |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
| 3546 |
cating that matching is to start at a point other than the beginning of |
cating that matching is to start at a point other than the beginning of |
| 3547 |
the subject, \A can never match. The difference between \Z and \z is |
the subject, \A can never match. The difference between \Z and \z is |
| 3548 |
that \Z matches before a newline at the end of the string as well as at |
that \Z matches before a newline at the end of the string as well as at |
| 3549 |
the very end, whereas \z matches only at the end. |
the very end, whereas \z matches only at the end. |
| 3550 |
|
|
| 3551 |
The \G assertion is true only when the current matching position is at |
The \G assertion is true only when the current matching position is at |
| 3552 |
the start point of the match, as specified by the startoffset argument |
the start point of the match, as specified by the startoffset argument |
| 3553 |
of pcre_exec(). It differs from \A when the value of startoffset is |
of pcre_exec(). It differs from \A when the value of startoffset is |
| 3554 |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
| 3555 |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
| 3556 |
mentation where \G can be useful. |
mentation where \G can be useful. |
| 3557 |
|
|
| 3558 |
Note, however, that PCRE's interpretation of \G, as the start of the |
Note, however, that PCRE's interpretation of \G, as the start of the |
| 3559 |
current match, is subtly different from Perl's, which defines it as the |
current match, is subtly different from Perl's, which defines it as the |
| 3560 |
end of the previous match. In Perl, these can be different when the |
end of the previous match. In Perl, these can be different when the |
| 3561 |
previously matched string was empty. Because PCRE does just one match |
previously matched string was empty. Because PCRE does just one match |
| 3562 |
at a time, it cannot reproduce this behaviour. |
at a time, it cannot reproduce this behaviour. |
| 3563 |
|
|
| 3564 |
If all the alternatives of a pattern begin with \G, the expression is |
If all the alternatives of a pattern begin with \G, the expression is |
| 3565 |
anchored to the starting match position, and the "anchored" flag is set |
anchored to the starting match position, and the "anchored" flag is set |
| 3566 |
in the compiled regular expression. |
in the compiled regular expression. |
| 3567 |
|
|
| 3569 |
CIRCUMFLEX AND DOLLAR |
CIRCUMFLEX AND DOLLAR |
| 3570 |
|
|
| 3571 |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
| 3572 |
character is an assertion that is true only if the current matching |
character is an assertion that is true only if the current matching |
| 3573 |
point is at the start of the subject string. If the startoffset argu- |
point is at the start of the subject string. If the startoffset argu- |
| 3574 |
ment of pcre_exec() is non-zero, circumflex can never match if the |
ment of pcre_exec() is non-zero, circumflex can never match if the |
| 3575 |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
| 3576 |
has an entirely different meaning (see below). |
has an entirely different meaning (see below). |
| 3577 |
|
|
| 3578 |
Circumflex need not be the first character of the pattern if a number |
Circumflex need not be the first character of the pattern if a number |
| 3579 |
of alternatives are involved, but it should be the first thing in each |
of alternatives are involved, but it should be the first thing in each |
| 3580 |
alternative in which it appears if the pattern is ever to match that |
alternative in which it appears if the pattern is ever to match that |
| 3581 |
branch. If all possible alternatives start with a circumflex, that is, |
branch. If all possible alternatives start with a circumflex, that is, |
| 3582 |
if the pattern is constrained to match only at the start of the sub- |
if the pattern is constrained to match only at the start of the sub- |
| 3583 |
ject, it is said to be an "anchored" pattern. (There are also other |
ject, it is said to be an "anchored" pattern. (There are also other |
| 3584 |
constructs that can cause a pattern to be anchored.) |
constructs that can cause a pattern to be anchored.) |
| 3585 |
|
|
| 3586 |
A dollar character is an assertion that is true only if the current |
A dollar character is an assertion that is true only if the current |
| 3587 |
matching point is at the end of the subject string, or immediately |
matching point is at the end of the subject string, or immediately |
| 3588 |
before a newline at the end of the string (by default). Dollar need not |
before a newline at the end of the string (by default). Dollar need not |
| 3589 |
be the last character of the pattern if a number of alternatives are |
be the last character of the pattern if a number of alternatives are |
| 3590 |
involved, but it should be the last item in any branch in which it |
involved, but it should be the last item in any branch in which it |
| 3591 |
appears. Dollar has no special meaning in a character class. |
appears. Dollar has no special meaning in a character class. |
| 3592 |
|
|
| 3593 |
The meaning of dollar can be changed so that it matches only at the |
The meaning of dollar can be changed so that it matches only at the |
| 3594 |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
| 3595 |
compile time. This does not affect the \Z assertion. |
compile time. This does not affect the \Z assertion. |
| 3596 |
|
|
| 3597 |
The meanings of the circumflex and dollar characters are changed if the |
The meanings of the circumflex and dollar characters are changed if the |
| 3598 |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
| 3599 |
matches immediately after internal newlines as well as at the start of |
matches immediately after internal newlines as well as at the start of |
| 3600 |
the subject string. It does not match after a newline that ends the |
the subject string. It does not match after a newline that ends the |
| 3601 |
string. A dollar matches before any newlines in the string, as well as |
string. A dollar matches before any newlines in the string, as well as |
| 3602 |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
| 3603 |
as the two-character sequence CRLF, isolated CR and LF characters do |
as the two-character sequence CRLF, isolated CR and LF characters do |
| 3604 |
not indicate newlines. |
not indicate newlines. |
| 3605 |
|
|
| 3606 |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
| 3607 |
(where \n represents a newline) in multiline mode, but not otherwise. |
(where \n represents a newline) in multiline mode, but not otherwise. |
| 3608 |
Consequently, patterns that are anchored in single line mode because |
Consequently, patterns that are anchored in single line mode because |
| 3609 |
all branches start with ^ are not anchored in multiline mode, and a |
all branches start with ^ are not anchored in multiline mode, and a |
| 3610 |
match for circumflex is possible when the startoffset argument of |
match for circumflex is possible when the startoffset argument of |
| 3611 |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
| 3612 |
PCRE_MULTILINE is set. |
PCRE_MULTILINE is set. |
| 3613 |
|
|
| 3614 |
Note that the sequences \A, \Z, and \z can be used to match the start |
Note that the sequences \A, \Z, and \z can be used to match the start |
| 3615 |
and end of the subject in both modes, and if all branches of a pattern |
and end of the subject in both modes, and if all branches of a pattern |
| 3616 |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
| 3617 |
set. |
set. |
| 3618 |
|
|
| 3619 |
|
|
| 3620 |
FULL STOP (PERIOD, DOT) |
FULL STOP (PERIOD, DOT) |
| 3621 |
|
|
| 3622 |
Outside a character class, a dot in the pattern matches any one charac- |
Outside a character class, a dot in the pattern matches any one charac- |
| 3623 |
ter in the subject string except (by default) a character that signi- |
ter in the subject string except (by default) a character that signi- |
| 3624 |
fies the end of a line. In UTF-8 mode, the matched character may be |
fies the end of a line. In UTF-8 mode, the matched character may be |
| 3625 |
more than one byte long. |
more than one byte long. |
| 3626 |
|
|
| 3627 |
When a line ending is defined as a single character, dot never matches |
When a line ending is defined as a single character, dot never matches |
| 3628 |
that character; when the two-character sequence CRLF is used, dot does |
that character; when the two-character sequence CRLF is used, dot does |
| 3629 |
not match CR if it is immediately followed by LF, but otherwise it |
not match CR if it is immediately followed by LF, but otherwise it |
| 3630 |
matches all characters (including isolated CRs and LFs). When any Uni- |
matches all characters (including isolated CRs and LFs). When any Uni- |
| 3631 |
code line endings are being recognized, dot does not match CR or LF or |
code line endings are being recognized, dot does not match CR or LF or |
| 3632 |
any of the other line ending characters. |
any of the other line ending characters. |
| 3633 |
|
|
| 3634 |
The behaviour of dot with regard to newlines can be changed. If the |
The behaviour of dot with regard to newlines can be changed. If the |
| 3635 |
PCRE_DOTALL option is set, a dot matches any one character, without |
PCRE_DOTALL option is set, a dot matches any one character, without |
| 3636 |
exception. If the two-character sequence CRLF is present in the subject |
exception. If the two-character sequence CRLF is present in the subject |
| 3637 |
string, it takes two dots to match it. |
string, it takes two dots to match it. |
| 3638 |
|
|
| 3639 |
The handling of dot is entirely independent of the handling of circum- |
The handling of dot is entirely independent of the handling of circum- |
| 3640 |
flex and dollar, the only relationship being that they both involve |
flex and dollar, the only relationship being that they both involve |
| 3641 |
newlines. Dot has no special meaning in a character class. |
newlines. Dot has no special meaning in a character class. |
| 3642 |
|
|
| 3643 |
|
|
| 3644 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
| 3645 |
|
|
| 3646 |
Outside a character class, the escape sequence \C matches any one byte, |
Outside a character class, the escape sequence \C matches any one byte, |
| 3647 |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
| 3648 |
line-ending characters. The feature is provided in Perl in order to |
line-ending characters. The feature is provided in Perl in order to |
| 3649 |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
| 3650 |
acters into individual bytes, what remains in the string may be a mal- |
acters into individual bytes, what remains in the string may be a mal- |
| 3651 |
formed UTF-8 string. For this reason, the \C escape sequence is best |
formed UTF-8 string. For this reason, the \C escape sequence is best |
| 3652 |
avoided. |
avoided. |
| 3653 |
|
|
| 3654 |
PCRE does not allow \C to appear in lookbehind assertions (described |
PCRE does not allow \C to appear in lookbehind assertions (described |
| 3655 |
below), because in UTF-8 mode this would make it impossible to calcu- |
below), because in UTF-8 mode this would make it impossible to calcu- |
| 3656 |
late the length of the lookbehind. |
late the length of the lookbehind. |
| 3657 |
|
|
| 3658 |
|
|
| 3661 |
An opening square bracket introduces a character class, terminated by a |
An opening square bracket introduces a character class, terminated by a |
| 3662 |
closing square bracket. A closing square bracket on its own is not spe- |
closing square bracket. A closing square bracket on its own is not spe- |
| 3663 |
cial. If a closing square bracket is required as a member of the class, |
cial. If a closing square bracket is required as a member of the class, |
| 3664 |
it should be the first data character in the class (after an initial |
it should be the first data character in the class (after an initial |
| 3665 |
circumflex, if present) or escaped with a backslash. |
circumflex, if present) or escaped with a backslash. |
| 3666 |
|
|
| 3667 |
A character class matches a single character in the subject. In UTF-8 |
A character class matches a single character in the subject. In UTF-8 |
| 3668 |
mode, the character may occupy more than one byte. A matched character |
mode, the character may occupy more than one byte. A matched character |
| 3669 |
must be in the set of characters defined by the class, unless the first |
must be in the set of characters defined by the class, unless the first |
| 3670 |
character in the class definition is a circumflex, in which case the |
character in the class definition is a circumflex, in which case the |
| 3671 |
subject character must not be in the set defined by the class. If a |
subject character must not be in the set defined by the class. If a |
| 3672 |
circumflex is actually required as a member of the class, ensure it is |
circumflex is actually required as a member of the class, ensure it is |
| 3673 |
not the first character, or escape it with a backslash. |
not the first character, or escape it with a backslash. |
| 3674 |
|
|
| 3675 |
For example, the character class [aeiou] matches any lower case vowel, |
For example, the character class [aeiou] matches any lower case vowel, |
| 3676 |
while [^aeiou] matches any character that is not a lower case vowel. |
while [^aeiou] matches any character that is not a lower case vowel. |
| 3677 |
Note that a circumflex is just a convenient notation for specifying the |
Note that a circumflex is just a convenient notation for specifying the |
| 3678 |
characters that are in the class by enumerating those that are not. A |
characters that are in the class by enumerating those that are not. A |
| 3679 |
class that starts with a circumflex is not an assertion: it still con- |
class that starts with a circumflex is not an assertion: it still con- |
| 3680 |
sumes a character from the subject string, and therefore it fails if |
sumes a character from the subject string, and therefore it fails if |
| 3681 |
the current pointer is at the end of the string. |
the current pointer is at the end of the string. |
| 3682 |
|
|
| 3683 |
In UTF-8 mode, characters with values greater than 255 can be included |
In UTF-8 mode, characters with values greater than 255 can be included |
| 3684 |
in a class as a literal string of bytes, or by using the \x{ escaping |
in a class as a literal string of bytes, or by using the \x{ escaping |
| 3685 |
mechanism. |
mechanism. |
| 3686 |
|
|
| 3687 |
When caseless matching is set, any letters in a class represent both |
When caseless matching is set, any letters in a class represent both |
| 3688 |
their upper case and lower case versions, so for example, a caseless |
their upper case and lower case versions, so for example, a caseless |
| 3689 |
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
| 3690 |
match "A", whereas a caseful version would. In UTF-8 mode, PCRE always |
match "A", whereas a caseful version would. In UTF-8 mode, PCRE always |
| 3691 |
understands the concept of case for characters whose values are less |
understands the concept of case for characters whose values are less |
| 3692 |
than 128, so caseless matching is always possible. For characters with |
than 128, so caseless matching is always possible. For characters with |
| 3693 |
higher values, the concept of case is supported if PCRE is compiled |
higher values, the concept of case is supported if PCRE is compiled |
| 3694 |
with Unicode property support, but not otherwise. If you want to use |
with Unicode property support, but not otherwise. If you want to use |
| 3695 |
caseless matching for characters 128 and above, you must ensure that |
caseless matching for characters 128 and above, you must ensure that |
| 3696 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
| 3697 |
support. |
support. |
| 3698 |
|
|
| 3699 |
Characters that might indicate line breaks are never treated in any |
Characters that might indicate line breaks are never treated in any |
| 3700 |
special way when matching character classes, whatever line-ending |
special way when matching character classes, whatever line-ending |
| 3701 |
sequence is in use, and whatever setting of the PCRE_DOTALL and |
sequence is in use, and whatever setting of the PCRE_DOTALL and |
| 3702 |
PCRE_MULTILINE options is used. A class such as [^a] always matches one |
PCRE_MULTILINE options is used. A class such as [^a] always matches one |
| 3703 |
of these characters. |
of these characters. |
| 3704 |
|
|
| 3705 |
The minus (hyphen) character can be used to specify a range of charac- |
The minus (hyphen) character can be used to specify a range of charac- |
| 3706 |
ters in a character class. For example, [d-m] matches any letter |
ters in a character class. For example, [d-m] matches any letter |
| 3707 |
between d and m, inclusive. If a minus character is required in a |
between d and m, inclusive. If a minus character is required in a |
| 3708 |
class, it must be escaped with a backslash or appear in a position |
class, it must be escaped with a backslash or appear in a position |
| 3709 |
where it cannot be interpreted as indicating a range, typically as the |
where it cannot be interpreted as indicating a range, typically as the |
| 3710 |
first or last character in the class. |
first or last character in the class. |
| 3711 |
|
|
| 3712 |
It is not possible to have the literal character "]" as the end charac- |
It is not possible to have the literal character "]" as the end charac- |
| 3713 |
ter of a range. A pattern such as [W-]46] is interpreted as a class of |
ter of a range. A pattern such as [W-]46] is interpreted as a class of |
| 3714 |
two characters ("W" and "-") followed by a literal string "46]", so it |
two characters ("W" and "-") followed by a literal string "46]", so it |
| 3715 |
would match "W46]" or "-46]". However, if the "]" is escaped with a |
would match "W46]" or "-46]". However, if the "]" is escaped with a |
| 3716 |
backslash it is interpreted as the end of range, so [W-\]46] is inter- |
backslash it is interpreted as the end of range, so [W-\]46] is inter- |
| 3717 |
preted as a class containing a range followed by two other characters. |
preted as a class containing a range followed by two other characters. |
| 3718 |
The octal or hexadecimal representation of "]" can also be used to end |
The octal or hexadecimal representation of "]" can also be used to end |
| 3719 |
a range. |
a range. |
| 3720 |
|
|
| 3721 |
Ranges operate in the collating sequence of character values. They can |
Ranges operate in the collating sequence of character values. They can |
| 3722 |
also be used for characters specified numerically, for example |
also be used for characters specified numerically, for example |
| 3723 |
[\000-\037]. In UTF-8 mode, ranges can include characters whose values |
[\000-\037]. In UTF-8 mode, ranges can include characters whose values |
| 3724 |
are greater than 255, for example [\x{100}-\x{2ff}]. |
are greater than 255, for example [\x{100}-\x{2ff}]. |
| 3725 |
|
|
| 3726 |
If a range that includes letters is used when caseless matching is set, |
If a range that includes letters is used when caseless matching is set, |
| 3727 |
it matches the letters in either case. For example, [W-c] is equivalent |
it matches the letters in either case. For example, [W-c] is equivalent |
| 3728 |
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
| 3729 |
character tables for a French locale are in use, [\xc8-\xcb] matches |
character tables for a French locale are in use, [\xc8-\xcb] matches |
| 3730 |
accented E characters in both cases. In UTF-8 mode, PCRE supports the |
accented E characters in both cases. In UTF-8 mode, PCRE supports the |
| 3731 |
concept of case for characters with values greater than 128 only when |
concept of case for characters with values greater than 128 only when |
| 3732 |
it is compiled with Unicode property support. |
it is compiled with Unicode property support. |
| 3733 |
|
|
| 3734 |
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear |
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear |
| 3735 |
in a character class, and add the characters that they match to the |
in a character class, and add the characters that they match to the |
| 3736 |
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- |
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- |
| 3737 |
flex can conveniently be used with the upper case character types to |
flex can conveniently be used with the upper case character types to |
| 3738 |
specify a more restricted set of characters than the matching lower |
specify a more restricted set of characters than the matching lower |
| 3739 |
case type. For example, the class [^\W_] matches any letter or digit, |
case type. For example, the class [^\W_] matches any letter or digit, |
| 3740 |
but not underscore. |
but not underscore. |
| 3741 |
|
|
| 3742 |
The only metacharacters that are recognized in character classes are |
The only metacharacters that are recognized in character classes are |
| 3743 |
backslash, hyphen (only where it can be interpreted as specifying a |
backslash, hyphen (only where it can be interpreted as specifying a |
| 3744 |
range), circumflex (only at the start), opening square bracket (only |
range), circumflex (only at the start), opening square bracket (only |
| 3745 |
when it can be interpreted as introducing a POSIX class name - see the |
when it can be interpreted as introducing a POSIX class name - see the |
| 3746 |
next section), and the terminating closing square bracket. However, |
next section), and the terminating closing square bracket. However, |
| 3747 |
escaping other non-alphanumeric characters does no harm. |
escaping other non-alphanumeric characters does no harm. |
| 3748 |
|
|
| 3749 |
|
|
| 3750 |
POSIX CHARACTER CLASSES |
POSIX CHARACTER CLASSES |
| 3751 |
|
|
| 3752 |
Perl supports the POSIX notation for character classes. This uses names |
Perl supports the POSIX notation for character classes. This uses names |
| 3753 |
enclosed by [: and :] within the enclosing square brackets. PCRE also |
enclosed by [: and :] within the enclosing square brackets. PCRE also |
| 3754 |
supports this notation. For example, |
supports this notation. For example, |
| 3755 |
|
|
| 3756 |
[01[:alpha:]%] |
[01[:alpha:]%] |
| 3773 |
word "word" characters (same as \w) |
word "word" characters (same as \w) |
| 3774 |
xdigit hexadecimal digits |
xdigit hexadecimal digits |
| 3775 |
|
|
| 3776 |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
| 3777 |
and space (32). Notice that this list includes the VT character (code |
and space (32). Notice that this list includes the VT character (code |
| 3778 |
11). This makes "space" different to \s, which does not include VT (for |
11). This makes "space" different to \s, which does not include VT (for |
| 3779 |
Perl compatibility). |
Perl compatibility). |
| 3780 |
|
|
| 3781 |
The name "word" is a Perl extension, and "blank" is a GNU extension |
The name "word" is a Perl extension, and "blank" is a GNU extension |
| 3782 |
from Perl 5.8. Another Perl extension is negation, which is indicated |
from Perl 5.8. Another Perl extension is negation, which is indicated |
| 3783 |
by a ^ character after the colon. For example, |
by a ^ character after the colon. For example, |
| 3784 |
|
|
| 3785 |
[12[:^digit:]] |
[12[:^digit:]] |
| 3786 |
|
|
| 3787 |
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
| 3788 |
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
| 3789 |
these are not supported, and an error is given if they are encountered. |
these are not supported, and an error is given if they are encountered. |
| 3790 |
|
|
| 3804 |
string). The matching process tries each alternative in turn, from left |
string). The matching process tries each alternative in turn, from left |
| 3805 |
to right, and the first one that succeeds is used. If the alternatives |
to right, and the first one that succeeds is used. If the alternatives |
| 3806 |
are within a subpattern (defined below), "succeeds" means matching the |
are within a subpattern (defined below), "succeeds" means matching the |
| 3807 |
rest of the main pattern as well as the alternative in the subpattern. |
rest of the main pattern as well as the alternative in the subpattern. |
| 3808 |
|
|
| 3809 |
|
|
| 3810 |
INTERNAL OPTION SETTING |
INTERNAL OPTION SETTING |
| 3811 |
|
|
| 3812 |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
| 3813 |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from |
| 3814 |
within the pattern by a sequence of Perl option letters enclosed |
within the pattern by a sequence of Perl option letters enclosed |
| 3815 |
between "(?" and ")". The option letters are |
between "(?" and ")". The option letters are |
| 3816 |
|
|
| 3817 |
i for PCRE_CASELESS |
i for PCRE_CASELESS |
| 3821 |
|
|
| 3822 |
For example, (?im) sets caseless, multiline matching. It is also possi- |
For example, (?im) sets caseless, multiline matching. It is also possi- |
| 3823 |
ble to unset these options by preceding the letter with a hyphen, and a |
ble to unset these options by preceding the letter with a hyphen, and a |
| 3824 |
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- |
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- |
| 3825 |
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, |
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, |
| 3826 |
is also permitted. If a letter appears both before and after the |
is also permitted. If a letter appears both before and after the |
| 3827 |
hyphen, the option is unset. |
hyphen, the option is unset. |
| 3828 |
|
|
| 3829 |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
| 3830 |
can be changed in the same way as the Perl-compatible options by using |
can be changed in the same way as the Perl-compatible options by using |
| 3831 |
the characters J, U and X respectively. |
the characters J, U and X respectively. |
| 3832 |
|
|
| 3833 |
When an option change occurs at top level (that is, not inside subpat- |
When an option change occurs at top level (that is, not inside subpat- |
| 3834 |
tern parentheses), the change applies to the remainder of the pattern |
tern parentheses), the change applies to the remainder of the pattern |
| 3835 |
that follows. If the change is placed right at the start of a pattern, |
that follows. If the change is placed right at the start of a pattern, |
| 3836 |
PCRE extracts it into the global options (and it will therefore show up |
PCRE extracts it into the global options (and it will therefore show up |
| 3837 |
in data extracted by the pcre_fullinfo() function). |
in data extracted by the pcre_fullinfo() function). |
| 3838 |
|
|
| 3839 |
An option change within a subpattern (see below for a description of |
An option change within a subpattern (see below for a description of |
| 3840 |
subpatterns) affects only that part of the current pattern that follows |
subpatterns) affects only that part of the current pattern that follows |
| 3841 |
it, so |
it, so |
| 3842 |
|
|
| 3843 |
(a(?i)b)c |
(a(?i)b)c |
| 3844 |
|
|
| 3845 |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
| 3846 |
used). By this means, options can be made to have different settings |
used). By this means, options can be made to have different settings |
| 3847 |
in different parts of the pattern. Any changes made in one alternative |
in different parts of the pattern. Any changes made in one alternative |
| 3848 |
do carry on into subsequent branches within the same subpattern. For |
do carry on into subsequent branches within the same subpattern. For |
| 3849 |
example, |
example, |
| 3850 |
|
|
| 3851 |
(a(?i)b|c) |
(a(?i)b|c) |
| 3852 |
|
|
| 3853 |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
| 3854 |
first branch is abandoned before the option setting. This is because |
first branch is abandoned before the option setting. This is because |
| 3855 |
the effects of option settings happen at compile time. There would be |
the effects of option settings happen at compile time. There would be |
| 3856 |
some very weird behaviour otherwise. |
some very weird behaviour otherwise. |
| 3857 |
|
|
| 3858 |
Note: There are other PCRE-specific options that can be set by the |
Note: There are other PCRE-specific options that can be set by the |
| 3859 |
application when the compile or match functions are called. In some |
application when the compile or match functions are called. In some |
| 3860 |
cases the pattern can contain special leading sequences to override |
cases the pattern can contain special leading sequences to override |
| 3861 |
what the application has set or what has been defaulted. Details are |
what the application has set or what has been defaulted. Details are |
| 3862 |
given in the section entitled "Newline sequences" above. |
given in the section entitled "Newline sequences" above. |
| 3863 |
|
|
| 3864 |
|
|
| 3871 |
|
|
| 3872 |
cat(aract|erpillar|) |
cat(aract|erpillar|) |
| 3873 |
|
|
| 3874 |
matches one of the words "cat", "cataract", or "caterpillar". Without |
matches one of the words "cat", "cataract", or "caterpillar". Without |
| 3875 |
the parentheses, it would match "cataract", "erpillar" or an empty |
the parentheses, it would match "cataract", "erpillar" or an empty |
| 3876 |
string. |
string. |
| 3877 |
|
|
| 3878 |
2. It sets up the subpattern as a capturing subpattern. This means |
2. It sets up the subpattern as a capturing subpattern. This means |
| 3879 |
that, when the whole pattern matches, that portion of the subject |
that, when the whole pattern matches, that portion of the subject |
| 3880 |
string that matched the subpattern is passed back to the caller via the |
string that matched the subpattern is passed back to the caller via the |
| 3881 |
ovector argument of pcre_exec(). Opening parentheses are counted from |
ovector argument of pcre_exec(). Opening parentheses are counted from |
| 3882 |
left to right (starting from 1) to obtain numbers for the capturing |
left to right (starting from 1) to obtain numbers for the capturing |
| 3883 |
subpatterns. |
subpatterns. |
| 3884 |
|
|
| 3885 |
For example, if the string "the red king" is matched against the pat- |
For example, if the string "the red king" is matched against the pat- |
| 3886 |
tern |
tern |
| 3887 |
|
|
| 3888 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
| 3890 |
the captured substrings are "red king", "red", and "king", and are num- |
the captured substrings are "red king", "red", and "king", and are num- |
| 3891 |
bered 1, 2, and 3, respectively. |
bered 1, 2, and 3, respectively. |
| 3892 |
|
|
| 3893 |
The fact that plain parentheses fulfil two functions is not always |
The fact that plain parentheses fulfil two functions is not always |
| 3894 |
helpful. There are often times when a grouping subpattern is required |
helpful. There are often times when a grouping subpattern is required |
| 3895 |
without a capturing requirement. If an opening parenthesis is followed |
without a capturing requirement. If an opening parenthesis is followed |
| 3896 |
by a question mark and a colon, the subpattern does not do any captur- |
by a question mark and a colon, the subpattern does not do any captur- |
| 3897 |
ing, and is not counted when computing the number of any subsequent |
ing, and is not counted when computing the number of any subsequent |
| 3898 |
capturing subpatterns. For example, if the string "the white queen" is |
capturing subpatterns. For example, if the string "the white queen" is |
| 3899 |
matched against the pattern |
matched against the pattern |
| 3900 |
|
|
| 3901 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
| 3903 |
the captured substrings are "white queen" and "queen", and are numbered |
the captured substrings are "white queen" and "queen", and are numbered |
| 3904 |
1 and 2. The maximum number of capturing subpatterns is 65535. |
1 and 2. The maximum number of capturing subpatterns is 65535. |
| 3905 |
|
|
| 3906 |
As a convenient shorthand, if any option settings are required at the |
As a convenient shorthand, if any option settings are required at the |
| 3907 |
start of a non-capturing subpattern, the option letters may appear |
start of a non-capturing subpattern, the option letters may appear |
| 3908 |
between the "?" and the ":". Thus the two patterns |
between the "?" and the ":". Thus the two patterns |
| 3909 |
|
|
| 3910 |
(?i:saturday|sunday) |
(?i:saturday|sunday) |
| 3911 |
(?:(?i)saturday|sunday) |
(?:(?i)saturday|sunday) |
| 3912 |
|
|
| 3913 |
match exactly the same set of strings. Because alternative branches are |
match exactly the same set of strings. Because alternative branches are |
| 3914 |
tried from left to right, and options are not reset until the end of |
tried from left to right, and options are not reset until the end of |
| 3915 |
the subpattern is reached, an option setting in one branch does affect |
the subpattern is reached, an option setting in one branch does affect |
| 3916 |
subsequent branches, so the above patterns match "SUNDAY" as well as |
subsequent branches, so the above patterns match "SUNDAY" as well as |
| 3917 |
"Saturday". |
"Saturday". |
| 3918 |
|
|
| 3919 |
|
|
| 3920 |
DUPLICATE SUBPATTERN NUMBERS |
DUPLICATE SUBPATTERN NUMBERS |
| 3921 |
|
|
| 3922 |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
| 3923 |
uses the same numbers for its capturing parentheses. Such a subpattern |
uses the same numbers for its capturing parentheses. Such a subpattern |
| 3924 |
starts with (?| and is itself a non-capturing subpattern. For example, |
starts with (?| and is itself a non-capturing subpattern. For example, |
| 3925 |
consider this pattern: |
consider this pattern: |
| 3926 |
|
|
| 3927 |
(?|(Sat)ur|(Sun))day |
(?|(Sat)ur|(Sun))day |
| 3928 |
|
|
| 3929 |
Because the two alternatives are inside a (?| group, both sets of cap- |
Because the two alternatives are inside a (?| group, both sets of cap- |
| 3930 |
turing parentheses are numbered one. Thus, when the pattern matches, |
turing parentheses are numbered one. Thus, when the pattern matches, |
| 3931 |
you can look at captured substring number one, whichever alternative |
you can look at captured substring number one, whichever alternative |
| 3932 |
matched. This construct is useful when you want to capture part, but |
matched. This construct is useful when you want to capture part, but |
| 3933 |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
| 3934 |
theses are numbered as usual, but the number is reset at the start of |
theses are numbered as usual, but the number is reset at the start of |
| 3935 |
each branch. The numbers of any capturing buffers that follow the sub- |
each branch. The numbers of any capturing buffers that follow the sub- |
| 3936 |
pattern start after the highest number used in any branch. The follow- |
pattern start after the highest number used in any branch. The follow- |
| 3937 |
ing example is taken from the Perl documentation. The numbers under- |
ing example is taken from the Perl documentation. The numbers under- |
| 3938 |
neath show in which buffer the captured content will be stored. |
neath show in which buffer the captured content will be stored. |
| 3939 |
|
|
| 3940 |
# before ---------------branch-reset----------- after |
# before ---------------branch-reset----------- after |
| 3941 |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
| 3942 |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
| 3943 |
|
|
| 3944 |
A backreference or a recursive call to a numbered subpattern always |
A backreference or a recursive call to a numbered subpattern always |
| 3945 |
refers to the first one in the pattern with the given number. |
refers to the first one in the pattern with the given number. |
| 3946 |
|
|
| 3947 |
An alternative approach to using this "branch reset" feature is to use |
An alternative approach to using this "branch reset" feature is to use |
| 3948 |
duplicate named subpatterns, as described in the next section. |
duplicate named subpatterns, as described in the next section. |
| 3949 |
|
|
| 3950 |
|
|
| 3951 |
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
| 3952 |
|
|
| 3953 |
Identifying capturing parentheses by number is simple, but it can be |
Identifying capturing parentheses by number is simple, but it can be |
| 3954 |
very hard to keep track of the numbers in complicated regular expres- |
very hard to keep track of the numbers in complicated regular expres- |
| 3955 |
sions. Furthermore, if an expression is modified, the numbers may |
sions. Furthermore, if an expression is modified, the numbers may |
| 3956 |
change. To help with this difficulty, PCRE supports the naming of sub- |
change. To help with this difficulty, PCRE supports the naming of sub- |
| 3957 |
patterns. This feature was not added to Perl until release 5.10. Python |
patterns. This feature was not added to Perl until release 5.10. Python |
| 3958 |
had the feature earlier, and PCRE introduced it at release 4.0, using |
had the feature earlier, and PCRE introduced it at release 4.0, using |
| 3959 |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
| 3960 |
tax. |
tax. |
| 3961 |
|
|
| 3962 |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
| 3963 |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
| 3964 |
to capturing parentheses from other parts of the pattern, such as back- |
to capturing parentheses from other parts of the pattern, such as back- |
| 3965 |
references, recursion, and conditions, can be made by name as well as |
references, recursion, and conditions, can be made by name as well as |
| 3966 |
by number. |
by number. |
| 3967 |
|
|
| 3968 |
Names consist of up to 32 alphanumeric characters and underscores. |
Names consist of up to 32 alphanumeric characters and underscores. |
| 3969 |
Named capturing parentheses are still allocated numbers as well as |
Named capturing parentheses are still allocated numbers as well as |
| 3970 |
names, exactly as if the names were not present. The PCRE API provides |
names, exactly as if the names were not present. The PCRE API provides |
| 3971 |
function calls for extracting the name-to-number translation table from |
function calls for extracting the name-to-number translation table from |
| 3972 |
a compiled pattern. There is also a convenience function for extracting |
a compiled pattern. There is also a convenience function for extracting |
| 3973 |
a captured substring by name. |
a captured substring by name. |
| 3974 |
|
|
| 3975 |
By default, a name must be unique within a pattern, but it is possible |
By default, a name must be unique within a pattern, but it is possible |
| 3976 |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
| 3977 |
time. This can be useful for patterns where only one instance of the |
time. This can be useful for patterns where only one instance of the |
| 3978 |
named parentheses can match. Suppose you want to match the name of a |
named parentheses can match. Suppose you want to match the name of a |
| 3979 |
weekday, either as a 3-letter abbreviation or as the full name, and in |
weekday, either as a 3-letter abbreviation or as the full name, and in |
| 3980 |
both cases you want to extract the abbreviation. This pattern (ignoring |
both cases you want to extract the abbreviation. This pattern (ignoring |
| 3981 |
the line breaks) does the job: |
the line breaks) does the job: |
| 3982 |
|
|
| 3986 |
(?<DN>Thu)(?:rsday)?| |
(?<DN>Thu)(?:rsday)?| |
| 3987 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
| 3988 |
|
|
| 3989 |
There are five capturing substrings, but only one is ever set after a |
There are five capturing substrings, but only one is ever set after a |
| 3990 |
match. (An alternative way of solving this problem is to use a "branch |
match. (An alternative way of solving this problem is to use a "branch |
| 3991 |
reset" subpattern, as described in the previous section.) |
reset" subpattern, as described in the previous section.) |
| 3992 |
|
|
| 3993 |
The convenience function for extracting the data by name returns the |
The convenience function for extracting the data by name returns the |
| 3994 |
substring for the first (and in this example, the only) subpattern of |
substring for the first (and in this example, the only) subpattern of |
| 3995 |
that name that matched. This saves searching to find which numbered |
that name that matched. This saves searching to find which numbered |
| 3996 |
subpattern it was. If you make a reference to a non-unique named sub- |
subpattern it was. If you make a reference to a non-unique named sub- |
| 3997 |
pattern from elsewhere in the pattern, the one that corresponds to the |
pattern from elsewhere in the pattern, the one that corresponds to the |
| 3998 |
lowest number is used. For further details of the interfaces for han- |
lowest number is used. For further details of the interfaces for han- |
| 3999 |
dling named subpatterns, see the pcreapi documentation. |
dling named subpatterns, see the pcreapi documentation. |
| 4000 |
|
|
| 4001 |
|
Warning: You cannot use different names to distinguish between two sub- |
| 4002 |
|
patterns with the same number (see the previous section) because PCRE |
| 4003 |
|
uses only the numbers when matching. |
| 4004 |
|
|
| 4005 |
|
|
| 4006 |
REPETITION |
REPETITION |
| 4007 |
|
|
| 4008 |
Repetition is specified by quantifiers, which can follow any of the |
Repetition is specified by quantifiers, which can follow any of the |
| 4009 |
following items: |
following items: |
| 4010 |
|
|
| 4011 |
a literal data character |
a literal data character |
| 4018 |
a back reference (see next section) |
a back reference (see next section) |
| 4019 |
a parenthesized subpattern (unless it is an assertion) |
a parenthesized subpattern (unless it is an assertion) |
| 4020 |
|
|
| 4021 |
The general repetition quantifier specifies a minimum and maximum num- |
The general repetition quantifier specifies a minimum and maximum num- |
| 4022 |
ber of permitted matches, by giving the two numbers in curly brackets |
ber of permitted matches, by giving the two numbers in curly brackets |
| 4023 |
(braces), separated by a comma. The numbers must be less than 65536, |
(braces), separated by a comma. The numbers must be less than 65536, |
| 4024 |
and the first must be less than or equal to the second. For example: |
and the first must be less than or equal to the second. For example: |
| 4025 |
|
|
| 4026 |
z{2,4} |
z{2,4} |
| 4027 |
|
|
| 4028 |
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
| 4029 |
special character. If the second number is omitted, but the comma is |
special character. If the second number is omitted, but the comma is |
| 4030 |
present, there is no upper limit; if the second number and the comma |
present, there is no upper limit; if the second number and the comma |
| 4031 |
are both omitted, the quantifier specifies an exact number of required |
are both omitted, the quantifier specifies an exact number of required |
| 4032 |
matches. Thus |
matches. Thus |
| 4033 |
|
|
| 4034 |
[aeiou]{3,} |
[aeiou]{3,} |
| 4037 |
|
|
| 4038 |
\d{8} |
\d{8} |
| 4039 |
|
|
| 4040 |
matches exactly 8 digits. An opening curly bracket that appears in a |
matches exactly 8 digits. An opening curly bracket that appears in a |
| 4041 |
position where a quantifier is not allowed, or one that does not match |
position where a quantifier is not allowed, or one that does not match |
| 4042 |
the syntax of a quantifier, is taken as a literal character. For exam- |
the syntax of a quantifier, is taken as a literal character. For exam- |
| 4043 |
ple, {,6} is not a quantifier, but a literal string of four characters. |
ple, {,6} is not a quantifier, but a literal string of four characters. |
| 4044 |
|
|
| 4045 |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
| 4183 |
|
|
| 4184 |
(?>\d+)foo |
(?>\d+)foo |
| 4185 |
|
|
| 4186 |
This kind of parenthesis "locks up" the part of the pattern it con- |
This kind of parenthesis "locks up" the part of the pattern it con- |
| 4187 |
tains once it has matched, and a failure further into the pattern is |
tains once it has matched, and a failure further into the pattern is |
| 4188 |
prevented from backtracking into it. Backtracking past it to previous |
prevented from backtracking into it. Backtracking past it to previous |
| 4189 |
items, however, works as normal. |
items, however, works as normal. |
| 4190 |
|
|
| 4191 |
An alternative description is that a subpattern of this type matches |
An alternative description is that a subpattern of this type matches |
| 4192 |
the string of characters that an identical standalone pattern would |
the string of characters that an identical standalone pattern would |
| 4193 |
match, if anchored at the current point in the subject string. |
match, if anchored at the current point in the subject string. |
| 4194 |
|
|
| 4195 |
Atomic grouping subpatterns are not capturing subpatterns. Simple cases |
Atomic grouping subpatterns are not capturing subpatterns. Simple cases |
| 4196 |
such as the above example can be thought of as a maximizing repeat that |
such as the above example can be thought of as a maximizing repeat that |
| 4197 |
must swallow everything it can. So, while both \d+ and \d+? are pre- |
must swallow everything it can. So, while both \d+ and \d+? are pre- |
| 4198 |
pared to adjust the number of digits they match in order to make the |
pared to adjust the number of digits they match in order to make the |
| 4199 |
rest of the pattern match, (?>\d+) can only match an entire sequence of |
rest of the pattern match, (?>\d+) can only match an entire sequence of |
| 4200 |
digits. |
digits. |
| 4201 |
|
|
| 4202 |
Atomic groups in general can of course contain arbitrarily complicated |
Atomic groups in general can of course contain arbitrarily complicated |
| 4203 |
subpatterns, and can be nested. However, when the subpattern for an |
subpatterns, and can be nested. However, when the subpattern for an |
| 4204 |
atomic group is just a single repeated item, as in the example above, a |
atomic group is just a single repeated item, as in the example above, a |
| 4205 |
simpler notation, called a "possessive quantifier" can be used. This |
simpler notation, called a "possessive quantifier" can be used. This |
| 4206 |
consists of an additional + character following a quantifier. Using |
consists of an additional + character following a quantifier. Using |
| 4207 |
this notation, the previous example can be rewritten as |
this notation, the previous example can be rewritten as |
| 4208 |
|
|
| 4209 |
\d++foo |
\d++foo |
| 4213 |
|
|
| 4214 |
(abc|xyz){2,3}+ |
(abc|xyz){2,3}+ |
| 4215 |
|
|
| 4216 |
Possessive quantifiers are always greedy; the setting of the |
Possessive quantifiers are always greedy; the setting of the |
| 4217 |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
| 4218 |
simpler forms of atomic group. However, there is no difference in the |
simpler forms of atomic group. However, there is no difference in the |
| 4219 |
meaning of a possessive quantifier and the equivalent atomic group, |
meaning of a possessive quantifier and the equivalent atomic group, |
| 4220 |
though there may be a performance difference; possessive quantifiers |
though there may be a performance difference; possessive quantifiers |
| 4221 |
should be slightly faster. |
should be slightly faster. |
| 4222 |
|
|
| 4223 |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
| 4224 |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
| 4225 |
edition of his book. Mike McCloskey liked it, so implemented it when he |
edition of his book. Mike McCloskey liked it, so implemented it when he |
| 4226 |
built Sun's Java package, and PCRE copied it from there. It ultimately |
built Sun's Java package, and PCRE copied it from there. It ultimately |
| 4227 |
found its way into Perl at release 5.10. |
found its way into Perl at release 5.10. |
| 4228 |
|
|
| 4229 |
PCRE has an optimization that automatically "possessifies" certain sim- |
PCRE has an optimization that automatically "possessifies" certain sim- |
| 4230 |
ple pattern constructs. For example, the sequence A+B is treated as |
ple pattern constructs. For example, the sequence A+B is treated as |
| 4231 |
A++B because there is no point in backtracking into a sequence of A's |
A++B because there is no point in backtracking into a sequence of A's |
| 4232 |
when B must follow. |
when B must follow. |
| 4233 |
|
|
| 4234 |
When a pattern contains an unlimited repeat inside a subpattern that |
When a pattern contains an unlimited repeat inside a subpattern that |
| 4235 |
can itself be repeated an unlimited number of times, the use of an |
can itself be repeated an unlimited number of times, the use of an |
| 4236 |
atomic group is the only way to avoid some failing matches taking a |
atomic group is the only way to avoid some failing matches taking a |
| 4237 |
very long time indeed. The pattern |
very long time indeed. The pattern |
| 4238 |
|
|
| 4239 |
(\D+|<\d+>)*[!?] |
(\D+|<\d+>)*[!?] |
| 4240 |
|
|
| 4241 |
matches an unlimited number of substrings that either consist of non- |
matches an unlimited number of substrings that either consist of non- |
| 4242 |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
| 4243 |
matches, it runs quickly. However, if it is applied to |
matches, it runs quickly. However, if it is applied to |
| 4244 |
|
|
| 4245 |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
| 4246 |
|
|
| 4247 |
it takes a long time before reporting failure. This is because the |
it takes a long time before reporting failure. This is because the |
| 4248 |
string can be divided between the internal \D+ repeat and the external |
string can be divided between the internal \D+ repeat and the external |
| 4249 |
* repeat in a large number of ways, and all have to be tried. (The |
* repeat in a large number of ways, and all have to be tried. (The |
| 4250 |
example uses [!?] rather than a single character at the end, because |
example uses [!?] rather than a single character at the end, because |
| 4251 |
both PCRE and Perl have an optimization that allows for fast failure |
both PCRE and Perl have an optimization that allows for fast failure |
| 4252 |
when a single character is used. They remember the last single charac- |
when a single character is used. They remember the last single charac- |
| 4253 |
ter that is required for a match, and fail early if it is not present |
ter that is required for a match, and fail early if it is not present |
| 4254 |
in the string.) If the pattern is changed so that it uses an atomic |
in the string.) If the pattern is changed so that it uses an atomic |
| 4255 |
group, like this: |
group, like this: |
| 4256 |
|
|
| 4257 |
((?>\D+)|<\d+>)*[!?] |
((?>\D+)|<\d+>)*[!?] |
| 4258 |
|
|
| 4259 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
| 4260 |
|
|
| 4261 |
|
|
| 4262 |
BACK REFERENCES |
BACK REFERENCES |
| 5019 |
|
|
| 5020 |
REVISION |
REVISION |
| 5021 |
|
|
| 5022 |
Last updated: 19 April 2008 |
Last updated: 08 March 2009 |
| 5023 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 5024 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5025 |
|
|
| 5026 |
|
|
| 5548 |
0: dogsbody |
0: dogsbody |
| 5549 |
1: dog |
1: dog |
| 5550 |
|
|
| 5551 |
The pattern matches the words "dog" or "dogsbody". When the subject is |
The pattern matches the words "dog" or "dogsbody". When the subject is |
| 5552 |
presented in several parts ("do" and "gsb" being the first two) the |
presented in several parts ("do" and "gsb" being the first two) the |
| 5553 |
match stops when "dog" has been found, and it is not possible to con- |
match stops when "dog" has been found, and it is not possible to con- |
| 5554 |
tinue. On the other hand, if "dogsbody" is presented as a single |
tinue. On the other hand, if "dogsbody" is presented as a single |
| 5555 |
string, both matches are found. |
string, both matches are found. |
| 5556 |
|
|
| 5557 |
Because of this phenomenon, it does not usually make sense to end a |
Because of this phenomenon, it does not usually make sense to end a |
| 5558 |
pattern that is going to be matched in this way with a variable repeat. |
pattern that is going to be matched in this way with a variable repeat. |
| 5559 |
|
|
| 5560 |
4. Patterns that contain alternatives at the top level which do not all |
4. Patterns that contain alternatives at the top level which do not all |
| 5901 |
command for linking an application that uses them. Because the POSIX |
command for linking an application that uses them. Because the POSIX |
| 5902 |
functions call the native ones, it is also necessary to add -lpcre. |
functions call the native ones, it is also necessary to add -lpcre. |
| 5903 |
|
|
| 5904 |
I have implemented only those option bits that can be reasonably mapped |
I have implemented only those POSIX option bits that can be reasonably |
| 5905 |
to PCRE native options. In addition, the option REG_EXTENDED is defined |
mapped to PCRE native options. In addition, the option REG_EXTENDED is |
| 5906 |
with the value zero. This has no effect, but since programs that are |
defined with the value zero. This has no effect, but since programs |
| 5907 |
written to the POSIX interface often use it, this makes it easier to |
that are written to the POSIX interface often use it, this makes it |
| 5908 |
slot in PCRE as a replacement library. Other POSIX options are not even |
easier to slot in PCRE as a replacement library. Other POSIX options |
| 5909 |
defined. |
are not even defined. |
| 5910 |
|
|
| 5911 |
When PCRE is called via these functions, it is only the API that is |
When PCRE is called via these functions, it is only the API that is |
| 5912 |
POSIX-like in style. The syntax and semantics of the regular expres- |
POSIX-like in style. The syntax and semantics of the regular expres- |
| 5986 |
MATCHING NEWLINE CHARACTERS |
MATCHING NEWLINE CHARACTERS |
| 5987 |
|
|
| 5988 |
This area is not simple, because POSIX and Perl take different views of |
This area is not simple, because POSIX and Perl take different views of |
| 5989 |
things. It is not possible to get PCRE to obey POSIX semantics, but |
things. It is not possible to get PCRE to obey POSIX semantics, but |
| 5990 |
then PCRE was never intended to be a POSIX engine. The following table |
then PCRE was never intended to be a POSIX engine. The following table |
| 5991 |
lists the different possibilities for matching newline characters in |
lists the different possibilities for matching newline characters in |
| 5992 |
PCRE: |
PCRE: |
| 5993 |
|
|
| 5994 |
Default Change with |
Default Change with |
| 6010 |
^ matches \n in middle no REG_NEWLINE |
^ matches \n in middle no REG_NEWLINE |
| 6011 |
|
|
| 6012 |
PCRE's behaviour is the same as Perl's, except that there is no equiva- |
PCRE's behaviour is the same as Perl's, except that there is no equiva- |
| 6013 |
lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is |
lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is |
| 6014 |
no way to stop newline from matching [^a]. |
no way to stop newline from matching [^a]. |
| 6015 |
|
|
| 6016 |
The default POSIX newline handling can be obtained by setting |
The default POSIX newline handling can be obtained by setting |
| 6017 |
PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE |
PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE |
| 6018 |
behave exactly as for the REG_NEWLINE action. |
behave exactly as for the REG_NEWLINE action. |
| 6019 |
|
|
| 6020 |
|
|
| 6021 |
MATCHING A PATTERN |
MATCHING A PATTERN |
| 6022 |
|
|
| 6023 |
The function regexec() is called to match a compiled pattern preg |
The function regexec() is called to match a compiled pattern preg |
| 6024 |
against a given string, which is by default terminated by a zero byte |
against a given string, which is by default terminated by a zero byte |
| 6025 |
(but see REG_STARTEND below), subject to the options in eflags. These |
(but see REG_STARTEND below), subject to the options in eflags. These |
| 6026 |
can be: |
can be: |
| 6027 |
|
|
| 6028 |
REG_NOTBOL |
REG_NOTBOL |
| 6030 |
The PCRE_NOTBOL option is set when calling the underlying PCRE matching |
The PCRE_NOTBOL option is set when calling the underlying PCRE matching |
| 6031 |
function. |
function. |
| 6032 |
|
|
| 6033 |
|
REG_NOTEMPTY |
| 6034 |
|
|
| 6035 |
|
The PCRE_NOTEMPTY option is set when calling the underlying PCRE match- |
| 6036 |
|
ing function. Note that REG_NOTEMPTY is not part of the POSIX standard. |
| 6037 |
|
However, setting this option can give more POSIX-like behaviour in some |
| 6038 |
|
situations. |
| 6039 |
|
|
| 6040 |
REG_NOTEOL |
REG_NOTEOL |
| 6041 |
|
|
| 6042 |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
| 6099 |
|
|
| 6100 |
REVISION |
REVISION |
| 6101 |
|
|
| 6102 |
Last updated: 05 April 2008 |
Last updated: 11 March 2009 |
| 6103 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 6104 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 6105 |
|
|
| 6106 |
|
|
| 6204 |
need more, consider using the more general interface |
need more, consider using the more general interface |
| 6205 |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
| 6206 |
|
|
| 6207 |
|
NOTE: Do not use no_arg, which is used internally to mark the end of a |
| 6208 |
|
list of optional arguments, as a placeholder for missing arguments, as |
| 6209 |
|
this can lead to segfaults. |
| 6210 |
|
|
| 6211 |
|
|
| 6212 |
QUOTING METACHARACTERS |
QUOTING METACHARACTERS |
| 6213 |
|
|
| 6441 |
|
|
| 6442 |
REVISION |
REVISION |
| 6443 |
|
|
| 6444 |
Last updated: 12 November 2007 |
Last updated: 17 March 2009 |
| 6445 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 6446 |
|
|
| 6447 |
|
|