| 18 |
|
|
| 19 |
The PCRE library is a set of functions that implement regular expres- |
The PCRE library is a set of functions that implement regular expres- |
| 20 |
sion pattern matching using the same syntax and semantics as Perl, with |
sion pattern matching using the same syntax and semantics as Perl, with |
| 21 |
just a few differences. The current implementation of PCRE (release |
just a few differences. (Certain features that appeared in Python and |
| 22 |
6.x) corresponds approximately with Perl 5.8, including support for |
PCRE before they appeared in Perl are also available using the Python |
| 23 |
UTF-8 encoded strings and Unicode general category properties. However, |
syntax.) |
| 24 |
this support has to be explicitly enabled; it is not the default. |
|
| 25 |
|
The current implementation of PCRE (release 7.x) corresponds approxi- |
| 26 |
In addition to the Perl-compatible matching function, PCRE also con- |
mately with Perl 5.10, including support for UTF-8 encoded strings and |
| 27 |
tains an alternative matching function that matches the same compiled |
Unicode general category properties. However, UTF-8 and Unicode support |
| 28 |
patterns in a different way. In certain circumstances, the alternative |
has to be explicitly enabled; it is not the default. The Unicode tables |
| 29 |
function has some advantages. For a discussion of the two matching |
correspond to Unicode release 5.0.0. |
| 30 |
algorithms, see the pcrematching page. |
|
| 31 |
|
In addition to the Perl-compatible matching function, PCRE contains an |
| 32 |
PCRE is written in C and released as a C library. A number of people |
alternative matching function that matches the same compiled patterns |
| 33 |
have written wrappers and interfaces of various kinds. In particular, |
in a different way. In certain circumstances, the alternative function |
| 34 |
Google Inc. have provided a comprehensive C++ wrapper. This is now |
has some advantages. For a discussion of the two matching algorithms, |
| 35 |
|
see the pcrematching page. |
| 36 |
|
|
| 37 |
|
PCRE is written in C and released as a C library. A number of people |
| 38 |
|
have written wrappers and interfaces of various kinds. In particular, |
| 39 |
|
Google Inc. have provided a comprehensive C++ wrapper. This is now |
| 40 |
included as part of the PCRE distribution. The pcrecpp page has details |
included as part of the PCRE distribution. The pcrecpp page has details |
| 41 |
of this interface. Other people's contributions can be found in the |
of this interface. Other people's contributions can be found in the |
| 42 |
Contrib directory at the primary FTP site, which is: |
Contrib directory at the primary FTP site, which is: |
| 43 |
|
|
| 44 |
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre |
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre |
| 45 |
|
|
| 46 |
Details of exactly which Perl regular expression features are and are |
Details of exactly which Perl regular expression features are and are |
| 47 |
not supported by PCRE are given in separate documents. See the pcrepat- |
not supported by PCRE are given in separate documents. See the pcrepat- |
| 48 |
tern and pcrecompat pages. |
tern and pcrecompat pages. |
| 49 |
|
|
| 50 |
Some features of PCRE can be included, excluded, or changed when the |
Some features of PCRE can be included, excluded, or changed when the |
| 51 |
library is built. The pcre_config() function makes it possible for a |
library is built. The pcre_config() function makes it possible for a |
| 52 |
client to discover which features are available. The features them- |
client to discover which features are available. The features them- |
| 53 |
selves are described in the pcrebuild page. Documentation about build- |
selves are described in the pcrebuild page. Documentation about build- |
| 54 |
ing PCRE for various operating systems can be found in the README file |
ing PCRE for various operating systems can be found in the README file |
| 55 |
in the source distribution. |
in the source distribution. |
| 56 |
|
|
| 57 |
The library contains a number of undocumented internal functions and |
The library contains a number of undocumented internal functions and |
| 58 |
data tables that are used by more than one of the exported external |
data tables that are used by more than one of the exported external |
| 59 |
functions, but which are not intended for use by external callers. |
functions, but which are not intended for use by external callers. |
| 60 |
Their names all begin with "_pcre_", which hopefully will not provoke |
Their names all begin with "_pcre_", which hopefully will not provoke |
| 61 |
any name clashes. In some environments, it is possible to control which |
any name clashes. In some environments, it is possible to control which |
| 62 |
external symbols are exported when a shared library is built, and in |
external symbols are exported when a shared library is built, and in |
| 63 |
these cases the undocumented symbols are not exported. |
these cases the undocumented symbols are not exported. |
| 64 |
|
|
| 65 |
|
|
| 66 |
USER DOCUMENTATION |
USER DOCUMENTATION |
| 67 |
|
|
| 68 |
The user documentation for PCRE comprises a number of different sec- |
The user documentation for PCRE comprises a number of different sec- |
| 69 |
tions. In the "man" format, each of these is a separate "man page". In |
tions. In the "man" format, each of these is a separate "man page". In |
| 70 |
the HTML format, each is a separate page, linked from the index page. |
the HTML format, each is a separate page, linked from the index page. |
| 71 |
In the plain text format, all the sections are concatenated, for ease |
In the plain text format, all the sections are concatenated, for ease |
| 72 |
of searching. The sections are as follows: |
of searching. The sections are as follows: |
| 73 |
|
|
| 74 |
pcre this document |
pcre this document |
| 89 |
pcrestack discussion of stack usage |
pcrestack discussion of stack usage |
| 90 |
pcretest description of the pcretest testing command |
pcretest description of the pcretest testing command |
| 91 |
|
|
| 92 |
In addition, in the "man" and HTML formats, there is a short page for |
In addition, in the "man" and HTML formats, there is a short page for |
| 93 |
each C library function, listing its arguments and results. |
each C library function, listing its arguments and results. |
| 94 |
|
|
| 95 |
|
|
| 96 |
LIMITATIONS |
LIMITATIONS |
| 97 |
|
|
| 98 |
There are some size limitations in PCRE but it is hoped that they will |
There are some size limitations in PCRE but it is hoped that they will |
| 99 |
never in practice be relevant. |
never in practice be relevant. |
| 100 |
|
|
| 101 |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
| 102 |
is compiled with the default internal linkage size of 2. If you want to |
is compiled with the default internal linkage size of 2. If you want to |
| 103 |
process regular expressions that are truly enormous, you can compile |
process regular expressions that are truly enormous, you can compile |
| 104 |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
| 105 |
the source distribution and the pcrebuild documentation for details). |
the source distribution and the pcrebuild documentation for details). |
| 106 |
In these cases the limit is substantially larger. However, the speed |
In these cases the limit is substantially larger. However, the speed |
| 107 |
of execution will be slower. |
of execution is slower. |
| 108 |
|
|
| 109 |
All values in repeating quantifiers must be less than 65536. The maxi- |
All values in repeating quantifiers must be less than 65536. The maxi- |
| 110 |
mum compiled length of subpattern with an explicit repeat count is |
mum compiled length of subpattern with an explicit repeat count is |
| 111 |
30000 bytes. The maximum number of capturing subpatterns is 65535. |
30000 bytes. The maximum number of capturing subpatterns is 65535. |
| 112 |
|
|
| 113 |
There is no limit to the number of non-capturing subpatterns, but the |
There is no limit to the number of parenthesized subpatterns, but there |
| 114 |
maximum depth of nesting of all kinds of parenthesized subpattern, |
can be no more than 65535 capturing subpatterns. |
|
including capturing subpatterns, assertions, and other types of subpat- |
|
|
tern, is 200. |
|
| 115 |
|
|
| 116 |
The maximum length of name for a named subpattern is 32, and the maxi- |
The maximum length of name for a named subpattern is 32 characters, and |
| 117 |
mum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
| 118 |
|
|
| 119 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
| 120 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
| 121 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
| 122 |
inite repetition. This means that the available stack space may limit |
inite repetition. This means that the available stack space may limit |
| 123 |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
| 124 |
For a discussion of stack issues, see the pcrestack documentation. |
For a discussion of stack issues, see the pcrestack documentation. |
| 125 |
|
|
| 126 |
|
|
| 127 |
UTF-8 AND UNICODE PROPERTY SUPPORT |
UTF-8 AND UNICODE PROPERTY SUPPORT |
| 128 |
|
|
| 129 |
From release 3.3, PCRE has had some support for character strings |
From release 3.3, PCRE has had some support for character strings |
| 130 |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
| 131 |
to cover most common requirements, and in release 5.0 additional sup- |
to cover most common requirements, and in release 5.0 additional sup- |
| 132 |
port for Unicode general category properties was added. |
port for Unicode general category properties was added. |
| 133 |
|
|
| 134 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
| 135 |
support in the code, and, in addition, you must call pcre_compile() |
support in the code, and, in addition, you must call pcre_compile() |
| 136 |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
| 137 |
any subject strings that are matched against it are treated as UTF-8 |
any subject strings that are matched against it are treated as UTF-8 |
| 138 |
strings instead of just strings of bytes. |
strings instead of just strings of bytes. |
| 139 |
|
|
| 140 |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
| 141 |
the library will be a bit bigger, but the additional run time overhead |
the library will be a bit bigger, but the additional run time overhead |
| 142 |
is limited to testing the PCRE_UTF8 flag in several places, so should |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
| 143 |
not be very large. |
very big. |
| 144 |
|
|
| 145 |
If PCRE is built with Unicode character property support (which implies |
If PCRE is built with Unicode character property support (which implies |
| 146 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
| 147 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
| 148 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
| 149 |
for a decimal number, the Unicode script names such as Arabic or Han, |
for a decimal number, the Unicode script names such as Arabic or Han, |
| 150 |
and the derived properties Any and L&. A full list is given in the |
and the derived properties Any and L&. A full list is given in the |
| 151 |
pcrepattern documentation. Only the short names for properties are sup- |
pcrepattern documentation. Only the short names for properties are sup- |
| 152 |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
| 153 |
ter}, is not supported. Furthermore, in Perl, many properties may |
ter}, is not supported. Furthermore, in Perl, many properties may |
| 154 |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
| 155 |
does not support this. |
does not support this. |
| 156 |
|
|
| 157 |
The following comments apply when PCRE is running in UTF-8 mode: |
The following comments apply when PCRE is running in UTF-8 mode: |
| 158 |
|
|
| 159 |
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and |
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and |
| 160 |
subjects are checked for validity on entry to the relevant functions. |
subjects are checked for validity on entry to the relevant functions. |
| 161 |
If an invalid UTF-8 string is passed, an error return is given. In some |
If an invalid UTF-8 string is passed, an error return is given. In some |
| 162 |
situations, you may already know that your strings are valid, and |
situations, you may already know that your strings are valid, and |
| 163 |
therefore want to skip these checks in order to improve performance. If |
therefore want to skip these checks in order to improve performance. If |
| 164 |
you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, |
you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, |
| 165 |
PCRE assumes that the pattern or subject it is given (respectively) |
PCRE assumes that the pattern or subject it is given (respectively) |
| 166 |
contains only valid UTF-8 codes. In this case, it does not diagnose an |
contains only valid UTF-8 codes. In this case, it does not diagnose an |
| 167 |
invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when |
invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when |
| 168 |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
| 169 |
crash. |
crash. |
| 170 |
|
|
| 171 |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
| 172 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
| 173 |
|
|
| 174 |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
| 175 |
characters for values greater than \177. |
characters for values greater than \177. |
| 176 |
|
|
| 177 |
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
| 178 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
| 179 |
|
|
| 180 |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
| 181 |
gle byte. |
gle byte. |
| 182 |
|
|
| 183 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
| 184 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
| 185 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
| 186 |
|
|
| 187 |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
| 188 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
| 189 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
| 190 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
| 191 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
| 192 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
| 193 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
| 194 |
\p{Nd}. |
\p{Nd}. |
| 195 |
|
|
| 196 |
8. Similarly, characters that match the POSIX named character classes |
8. Similarly, characters that match the POSIX named character classes |
| 197 |
are all low-valued characters. |
are all low-valued characters. |
| 198 |
|
|
| 199 |
9. Case-insensitive matching applies only to characters whose values |
9. Case-insensitive matching applies only to characters whose values |
| 200 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
| 201 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
| 202 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
| 203 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
| 204 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
| 205 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
| 206 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
| 207 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
| 208 |
ported by PCRE. |
ported by PCRE. |
| 209 |
|
|
| 210 |
|
|
| 212 |
|
|
| 213 |
Philip Hazel |
Philip Hazel |
| 214 |
University Computing Service, |
University Computing Service, |
| 215 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QH, England. |
| 216 |
|
|
| 217 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
| 218 |
so I've taken it away. If you want to email me, use my initial and sur- |
so I've taken it away. If you want to email me, use my initial and sur- |
| 219 |
name, separated by a dot, at the domain ucs.cam.ac.uk. |
name, separated by a dot, at the domain ucs.cam.ac.uk. |
| 220 |
|
|
| 221 |
Last updated: 05 June 2006 |
Last updated: 23 November 2006 |
| 222 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 223 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 224 |
|
|
| 308 |
|
|
| 309 |
--enable-newline-is-crlf |
--enable-newline-is-crlf |
| 310 |
|
|
| 311 |
to the configure command. Whatever line ending convention is selected |
to the configure command. There is a fourth option, specified by |
| 312 |
when PCRE is built can be overridden when the library functions are |
|
| 313 |
called. At build time it is conventional to use the standard for your |
--enable-newline-is-any |
| 314 |
operating system. |
|
| 315 |
|
which causes PCRE to recognize any Unicode newline sequence. |
| 316 |
|
|
| 317 |
|
Whatever line ending convention is selected when PCRE is built can be |
| 318 |
|
overridden when the library functions are called. At build time it is |
| 319 |
|
conventional to use the standard for your operating system. |
| 320 |
|
|
| 321 |
|
|
| 322 |
BUILDING SHARED AND STATIC LIBRARIES |
BUILDING SHARED AND STATIC LIBRARIES |
| 323 |
|
|
| 324 |
The PCRE building process uses libtool to build both shared and static |
The PCRE building process uses libtool to build both shared and static |
| 325 |
Unix libraries by default. You can suppress one of these by adding one |
Unix libraries by default. You can suppress one of these by adding one |
| 326 |
of |
of |
| 327 |
|
|
| 328 |
--disable-shared |
--disable-shared |
| 334 |
POSIX MALLOC USAGE |
POSIX MALLOC USAGE |
| 335 |
|
|
| 336 |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
| 337 |
umentation), additional working storage is required for holding the |
umentation), additional working storage is required for holding the |
| 338 |
pointers to capturing substrings, because PCRE requires three integers |
pointers to capturing substrings, because PCRE requires three integers |
| 339 |
per substring, whereas the POSIX interface provides only two. If the |
per substring, whereas the POSIX interface provides only two. If the |
| 340 |
number of expected substrings is small, the wrapper function uses space |
number of expected substrings is small, the wrapper function uses space |
| 341 |
on the stack, because this is faster than using malloc() for each call. |
on the stack, because this is faster than using malloc() for each call. |
| 342 |
The default threshold above which the stack is no longer used is 10; it |
The default threshold above which the stack is no longer used is 10; it |
| 349 |
|
|
| 350 |
HANDLING VERY LARGE PATTERNS |
HANDLING VERY LARGE PATTERNS |
| 351 |
|
|
| 352 |
Within a compiled pattern, offset values are used to point from one |
Within a compiled pattern, offset values are used to point from one |
| 353 |
part to another (for example, from an opening parenthesis to an alter- |
part to another (for example, from an opening parenthesis to an alter- |
| 354 |
nation metacharacter). By default, two-byte values are used for these |
nation metacharacter). By default, two-byte values are used for these |
| 355 |
offsets, leading to a maximum size for a compiled pattern of around |
offsets, leading to a maximum size for a compiled pattern of around |
| 356 |
64K. This is sufficient to handle all but the most gigantic patterns. |
64K. This is sufficient to handle all but the most gigantic patterns. |
| 357 |
Nevertheless, some people do want to process enormous patterns, so it |
Nevertheless, some people do want to process enormous patterns, so it |
| 358 |
is possible to compile PCRE to use three-byte or four-byte offsets by |
is possible to compile PCRE to use three-byte or four-byte offsets by |
| 359 |
adding a setting such as |
adding a setting such as |
| 360 |
|
|
| 361 |
--with-link-size=3 |
--with-link-size=3 |
| 362 |
|
|
| 363 |
to the configure command. The value given must be 2, 3, or 4. Using |
to the configure command. The value given must be 2, 3, or 4. Using |
| 364 |
longer offsets slows down the operation of PCRE because it has to load |
longer offsets slows down the operation of PCRE because it has to load |
| 365 |
additional bytes when handling them. |
additional bytes when handling them. |
| 366 |
|
|
| 367 |
If you build PCRE with an increased link size, test 2 (and test 5 if |
If you build PCRE with an increased link size, test 2 (and test 5 if |
| 368 |
you are using UTF-8) will fail. Part of the output of these tests is a |
you are using UTF-8) will fail. Part of the output of these tests is a |
| 369 |
representation of the compiled pattern, and this changes with the link |
representation of the compiled pattern, and this changes with the link |
| 370 |
size. |
size. |
| 371 |
|
|
| 372 |
|
|
| 373 |
AVOIDING EXCESSIVE STACK USAGE |
AVOIDING EXCESSIVE STACK USAGE |
| 374 |
|
|
| 375 |
When matching with the pcre_exec() function, PCRE implements backtrack- |
When matching with the pcre_exec() function, PCRE implements backtrack- |
| 376 |
ing by making recursive calls to an internal function called match(). |
ing by making recursive calls to an internal function called match(). |
| 377 |
In environments where the size of the stack is limited, this can se- |
In environments where the size of the stack is limited, this can se- |
| 378 |
verely limit PCRE's operation. (The Unix environment does not usually |
verely limit PCRE's operation. (The Unix environment does not usually |
| 379 |
suffer from this problem, but it may sometimes be necessary to increase |
suffer from this problem, but it may sometimes be necessary to increase |
| 380 |
the maximum stack size. There is a discussion in the pcrestack docu- |
the maximum stack size. There is a discussion in the pcrestack docu- |
| 381 |
mentation.) An alternative approach to recursion that uses memory from |
mentation.) An alternative approach to recursion that uses memory from |
| 382 |
the heap to remember data, instead of using recursive function calls, |
the heap to remember data, instead of using recursive function calls, |
| 383 |
has been implemented to work round the problem of limited stack size. |
has been implemented to work round the problem of limited stack size. |
| 384 |
If you want to build a version of PCRE that works this way, add |
If you want to build a version of PCRE that works this way, add |
| 385 |
|
|
| 386 |
--disable-stack-for-recursion |
--disable-stack-for-recursion |
| 387 |
|
|
| 388 |
to the configure command. With this configuration, PCRE will use the |
to the configure command. With this configuration, PCRE will use the |
| 389 |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
| 390 |
ment functions. Separate functions are provided because the usage is |
ment functions. Separate functions are provided because the usage is |
| 391 |
very predictable: the block sizes requested are always the same, and |
very predictable: the block sizes requested are always the same, and |
| 392 |
the blocks are always freed in reverse order. A calling program might |
the blocks are always freed in reverse order. A calling program might |
| 393 |
be able to implement optimized functions that perform better than the |
be able to implement optimized functions that perform better than the |
| 394 |
standard malloc() and free() functions. PCRE runs noticeably more |
standard malloc() and free() functions. PCRE runs noticeably more |
| 395 |
slowly when built in this way. This option affects only the pcre_exec() |
slowly when built in this way. This option affects only the pcre_exec() |
| 396 |
function; it is not relevant for the the pcre_dfa_exec() function. |
function; it is not relevant for the the pcre_dfa_exec() function. |
| 397 |
|
|
| 398 |
|
|
| 399 |
LIMITING PCRE RESOURCE USAGE |
LIMITING PCRE RESOURCE USAGE |
| 400 |
|
|
| 401 |
Internally, PCRE has a function called match(), which it calls repeat- |
Internally, PCRE has a function called match(), which it calls repeat- |
| 402 |
edly (sometimes recursively) when matching a pattern with the |
edly (sometimes recursively) when matching a pattern with the |
| 403 |
pcre_exec() function. By controlling the maximum number of times this |
pcre_exec() function. By controlling the maximum number of times this |
| 404 |
function may be called during a single matching operation, a limit can |
function may be called during a single matching operation, a limit can |
| 405 |
be placed on the resources used by a single call to pcre_exec(). The |
be placed on the resources used by a single call to pcre_exec(). The |
| 406 |
limit can be changed at run time, as described in the pcreapi documen- |
limit can be changed at run time, as described in the pcreapi documen- |
| 407 |
tation. The default is 10 million, but this can be changed by adding a |
tation. The default is 10 million, but this can be changed by adding a |
| 408 |
setting such as |
setting such as |
| 409 |
|
|
| 410 |
--with-match-limit=500000 |
--with-match-limit=500000 |
| 411 |
|
|
| 412 |
to the configure command. This setting has no effect on the |
to the configure command. This setting has no effect on the |
| 413 |
pcre_dfa_exec() matching function. |
pcre_dfa_exec() matching function. |
| 414 |
|
|
| 415 |
In some environments it is desirable to limit the depth of recursive |
In some environments it is desirable to limit the depth of recursive |
| 416 |
calls of match() more strictly than the total number of calls, in order |
calls of match() more strictly than the total number of calls, in order |
| 417 |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
| 418 |
for-recursion is specified) that is used. A second limit controls this; |
for-recursion is specified) that is used. A second limit controls this; |
| 419 |
it defaults to the value that is set for --with-match-limit, which |
it defaults to the value that is set for --with-match-limit, which |
| 420 |
imposes no additional constraints. However, you can set a lower limit |
imposes no additional constraints. However, you can set a lower limit |
| 421 |
by adding, for example, |
by adding, for example, |
| 422 |
|
|
| 423 |
--with-match-limit-recursion=10000 |
--with-match-limit-recursion=10000 |
| 424 |
|
|
| 425 |
to the configure command. This value can also be overridden at run |
to the configure command. This value can also be overridden at run |
| 426 |
time. |
time. |
| 427 |
|
|
| 428 |
|
|
| 429 |
USING EBCDIC CODE |
USING EBCDIC CODE |
| 430 |
|
|
| 431 |
PCRE assumes by default that it will run in an environment where the |
PCRE assumes by default that it will run in an environment where the |
| 432 |
character code is ASCII (or Unicode, which is a superset of ASCII). |
character code is ASCII (or Unicode, which is a superset of ASCII). |
| 433 |
PCRE can, however, be compiled to run in an EBCDIC environment by |
PCRE can, however, be compiled to run in an EBCDIC environment by |
| 434 |
adding |
adding |
| 435 |
|
|
| 436 |
--enable-ebcdic |
--enable-ebcdic |
| 437 |
|
|
| 438 |
to the configure command. |
to the configure command. |
| 439 |
|
|
| 440 |
Last updated: 06 June 2006 |
|
| 441 |
|
SEE ALSO |
| 442 |
|
|
| 443 |
|
pcreapi(3), pcre_config(3). |
| 444 |
|
|
| 445 |
|
Last updated: 30 November 2006 |
| 446 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 447 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 448 |
|
|
| 479 |
<something> <something else> <something further> |
<something> <something else> <something further> |
| 480 |
|
|
| 481 |
there are three possible answers. The standard algorithm finds only one |
there are three possible answers. The standard algorithm finds only one |
| 482 |
of them, whereas the DFA algorithm finds all three. |
of them, whereas the alternative algorithm finds all three. |
| 483 |
|
|
| 484 |
|
|
| 485 |
REGULAR EXPRESSIONS AS TREES |
REGULAR EXPRESSIONS AS TREES |
| 520 |
This provides support for capturing parentheses and back references. |
This provides support for capturing parentheses and back references. |
| 521 |
|
|
| 522 |
|
|
| 523 |
THE DFA MATCHING ALGORITHM |
THE ALTERNATIVE MATCHING ALGORITHM |
| 524 |
|
|
| 525 |
DFA stands for "deterministic finite automaton", but you do not need to |
This algorithm conducts a breadth-first search of the tree. Starting |
| 526 |
understand the origins of that name. This algorithm conducts a breadth- |
from the first matching point in the subject, it scans the subject |
| 527 |
first search of the tree. Starting from the first matching point in the |
string from left to right, once, character by character, and as it does |
| 528 |
subject, it scans the subject string from left to right, once, charac- |
this, it remembers all the paths through the tree that represent valid |
| 529 |
ter by character, and as it does this, it remembers all the paths |
matches. In Friedl's terminology, this is a kind of "DFA algorithm", |
| 530 |
through the tree that represent valid matches. |
though it is not implemented as a traditional finite state machine (it |
| 531 |
|
keeps multiple states active simultaneously). |
| 532 |
The scan continues until either the end of the subject is reached, or |
|
| 533 |
there are no more unterminated paths. At this point, terminated paths |
The scan continues until either the end of the subject is reached, or |
| 534 |
represent the different matching possibilities (if there are none, the |
there are no more unterminated paths. At this point, terminated paths |
| 535 |
match has failed). Thus, if there is more than one possible match, |
represent the different matching possibilities (if there are none, the |
| 536 |
|
match has failed). Thus, if there is more than one possible match, |
| 537 |
this algorithm finds all of them, and in particular, it finds the long- |
this algorithm finds all of them, and in particular, it finds the long- |
| 538 |
est. In PCRE, there is an option to stop the algorithm after the first |
est. In PCRE, there is an option to stop the algorithm after the first |
| 539 |
match (which is necessarily the shortest) has been found. |
match (which is necessarily the shortest) has been found. |
| 540 |
|
|
| 541 |
Note that all the matches that are found start at the same point in the |
Note that all the matches that are found start at the same point in the |
| 543 |
|
|
| 544 |
cat(er(pillar)?) |
cat(er(pillar)?) |
| 545 |
|
|
| 546 |
is matched against the string "the caterpillar catchment", the result |
is matched against the string "the caterpillar catchment", the result |
| 547 |
will be the three strings "cat", "cater", and "caterpillar" that start |
will be the three strings "cat", "cater", and "caterpillar" that start |
| 548 |
at the fourth character of the subject. The algorithm does not automat- |
at the fourth character of the subject. The algorithm does not automat- |
| 549 |
ically move on to find matches that start at later positions. |
ically move on to find matches that start at later positions. |
| 550 |
|
|
| 551 |
There are a number of features of PCRE regular expressions that are not |
There are a number of features of PCRE regular expressions that are not |
| 552 |
supported by the DFA matching algorithm. They are as follows: |
supported by the alternative matching algorithm. They are as follows: |
| 553 |
|
|
| 554 |
1. Because the algorithm finds all possible matches, the greedy or |
1. Because the algorithm finds all possible matches, the greedy or |
| 555 |
ungreedy nature of repetition quantifiers is not relevant. Greedy and |
ungreedy nature of repetition quantifiers is not relevant. Greedy and |
| 556 |
ungreedy quantifiers are treated in exactly the same way. |
ungreedy quantifiers are treated in exactly the same way. However, pos- |
| 557 |
|
sessive quantifiers can make a difference when what follows could also |
| 558 |
|
match what is quantified, for example in a pattern like this: |
| 559 |
|
|
| 560 |
|
^a++\w! |
| 561 |
|
|
| 562 |
|
This pattern matches "aaab!" but not "aaa!", which would be matched by |
| 563 |
|
a non-possessive quantifier. Similarly, if an atomic group is present, |
| 564 |
|
it is matched as if it were a standalone pattern at the current point, |
| 565 |
|
and the longest match is then "locked in" for the rest of the overall |
| 566 |
|
pattern. |
| 567 |
|
|
| 568 |
2. When dealing with multiple paths through the tree simultaneously, it |
2. When dealing with multiple paths through the tree simultaneously, it |
| 569 |
is not straightforward to keep track of captured substrings for the |
is not straightforward to keep track of captured substrings for the |
| 570 |
different matching possibilities, and PCRE's implementation of this |
different matching possibilities, and PCRE's implementation of this |
| 571 |
algorithm does not attempt to do this. This means that no captured sub- |
algorithm does not attempt to do this. This means that no captured sub- |
| 572 |
strings are available. |
strings are available. |
| 573 |
|
|
| 574 |
3. Because no substrings are captured, back references within the pat- |
3. Because no substrings are captured, back references within the pat- |
| 575 |
tern are not supported, and cause errors if encountered. |
tern are not supported, and cause errors if encountered. |
| 576 |
|
|
| 577 |
4. For the same reason, conditional expressions that use a backrefer- |
4. For the same reason, conditional expressions that use a backrefer- |
| 578 |
ence as the condition are not supported. |
ence as the condition or test for a specific group recursion are not |
| 579 |
|
supported. |
| 580 |
|
|
| 581 |
5. Callouts are supported, but the value of the capture_top field is |
5. Callouts are supported, but the value of the capture_top field is |
| 582 |
always 1, and the value of the capture_last field is always -1. |
always 1, and the value of the capture_last field is always -1. |
| 583 |
|
|
| 584 |
6. The \C escape sequence, which (in the standard algorithm) matches a |
6. The \C escape sequence, which (in the standard algorithm) matches a |
| 585 |
single byte, even in UTF-8 mode, is not supported because the DFA algo- |
single byte, even in UTF-8 mode, is not supported because the alterna- |
| 586 |
rithm moves through the subject string one character at a time, for all |
tive algorithm moves through the subject string one character at a |
| 587 |
active paths through the tree. |
time, for all active paths through the tree. |
| 588 |
|
|
| 589 |
|
|
| 590 |
ADVANTAGES OF THE DFA ALGORITHM |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
| 591 |
|
|
| 592 |
Using the DFA matching algorithm provides the following advantages: |
Using the alternative matching algorithm provides the following advan- |
| 593 |
|
tages: |
| 594 |
|
|
| 595 |
1. All possible matches (at a single point in the subject) are automat- |
1. All possible matches (at a single point in the subject) are automat- |
| 596 |
ically found, and in particular, the longest match is found. To find |
ically found, and in particular, the longest match is found. To find |
| 597 |
more than one match using the standard algorithm, you have to do kludgy |
more than one match using the standard algorithm, you have to do kludgy |
| 598 |
things with callouts. |
things with callouts. |
| 599 |
|
|
| 600 |
2. There is much better support for partial matching. The restrictions |
2. There is much better support for partial matching. The restrictions |
| 601 |
on the content of the pattern that apply when using the standard algo- |
on the content of the pattern that apply when using the standard algo- |
| 602 |
rithm for partial matching do not apply to the DFA algorithm. For non- |
rithm for partial matching do not apply to the alternative algorithm. |
| 603 |
anchored patterns, the starting position of a partial match is avail- |
For non-anchored patterns, the starting position of a partial match is |
| 604 |
able. |
available. |
| 605 |
|
|
| 606 |
3. Because the DFA algorithm scans the subject string just once, and |
3. Because the alternative algorithm scans the subject string just |
| 607 |
never needs to backtrack, it is possible to pass very long subject |
once, and never needs to backtrack, it is possible to pass very long |
| 608 |
strings to the matching function in several pieces, checking for par- |
subject strings to the matching function in several pieces, checking |
| 609 |
tial matching each time. |
for partial matching each time. |
| 610 |
|
|
| 611 |
|
|
| 612 |
DISADVANTAGES OF THE DFA ALGORITHM |
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM |
| 613 |
|
|
| 614 |
The DFA algorithm suffers from a number of disadvantages: |
The alternative algorithm suffers from a number of disadvantages: |
| 615 |
|
|
| 616 |
1. It is substantially slower than the standard algorithm. This is |
1. It is substantially slower than the standard algorithm. This is |
| 617 |
partly because it has to search for all possible matches, but is also |
partly because it has to search for all possible matches, but is also |
| 618 |
because it is less susceptible to optimization. |
because it is less susceptible to optimization. |
| 619 |
|
|
| 620 |
2. Capturing parentheses and back references are not supported. |
2. Capturing parentheses and back references are not supported. |
| 621 |
|
|
| 622 |
3. The "atomic group" feature of PCRE regular expressions is supported, |
3. Although atomic groups are supported, their use does not provide the |
| 623 |
but does not provide the advantage that it does for the standard algo- |
performance advantage that it does for the standard algorithm. |
|
rithm. |
|
| 624 |
|
|
| 625 |
Last updated: 06 June 2006 |
Last updated: 24 November 2006 |
| 626 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 627 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 628 |
|
|
| 717 |
PCRE API OVERVIEW |
PCRE API OVERVIEW |
| 718 |
|
|
| 719 |
PCRE has its own native API, which is described in this document. There |
PCRE has its own native API, which is described in this document. There |
| 720 |
is also a set of wrapper functions that correspond to the POSIX regular |
are also some wrapper functions that correspond to the POSIX regular |
| 721 |
expression API. These are described in the pcreposix documentation. |
expression API. These are described in the pcreposix documentation. |
| 722 |
Both of these APIs define a set of C function calls. A C++ wrapper is |
Both of these APIs define a set of C function calls. A C++ wrapper is |
| 723 |
distributed with PCRE. It is documented in the pcrecpp page. |
distributed with PCRE. It is documented in the pcrecpp page. |
| 740 |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
| 741 |
ble, is also provided. This uses a different algorithm for the match- |
ble, is also provided. This uses a different algorithm for the match- |
| 742 |
ing. The alternative algorithm finds all possible matches (at a given |
ing. The alternative algorithm finds all possible matches (at a given |
| 743 |
point in the subject). However, this algorithm does not return captured |
point in the subject), and scans the subject just once. However, this |
| 744 |
substrings. A description of the two matching algorithms and their |
algorithm does not return captured substrings. A description of the two |
| 745 |
advantages and disadvantages is given in the pcrematching documenta- |
matching algorithms and their advantages and disadvantages is given in |
| 746 |
tion. |
the pcrematching documentation. |
| 747 |
|
|
| 748 |
In addition to the main compiling and matching functions, there are |
In addition to the main compiling and matching functions, there are |
| 749 |
convenience functions for extracting captured substrings from a subject |
convenience functions for extracting captured substrings from a subject |
| 804 |
|
|
| 805 |
|
|
| 806 |
NEWLINES |
NEWLINES |
| 807 |
PCRE supports three different conventions for indicating line breaks in |
|
| 808 |
strings: a single CR character, a single LF character, or the two-char- |
PCRE supports four different conventions for indicating line breaks in |
| 809 |
acter sequence CRLF. All three are used as "standard" by different |
strings: a single CR (carriage return) character, a single LF (line- |
| 810 |
operating systems. When PCRE is built, a default can be specified. The |
feed) character, the two-character sequence CRLF, or any Unicode new- |
| 811 |
default default is LF, which is the Unix standard. When PCRE is run, |
line sequence. The Unicode newline sequences are the three just men- |
| 812 |
the default can be overridden, either when a pattern is compiled, or |
tioned, plus the single characters VT (vertical tab, U+000B), FF (form- |
| 813 |
when it is matched. |
feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), |
| 814 |
|
and PS (paragraph separator, U+2029). |
| 815 |
|
|
| 816 |
|
Each of the first three conventions is used by at least one operating |
| 817 |
|
system as its standard newline sequence. When PCRE is built, a default |
| 818 |
|
can be specified. The default default is LF, which is the Unix stan- |
| 819 |
|
dard. When PCRE is run, the default can be overridden, either when a |
| 820 |
|
pattern is compiled, or when it is matched. |
| 821 |
|
|
| 822 |
In the PCRE documentation the word "newline" is used to mean "the char- |
In the PCRE documentation the word "newline" is used to mean "the char- |
| 823 |
acter or pair of characters that indicate a line break". |
acter or pair of characters that indicate a line break". The choice of |
| 824 |
|
newline convention affects the handling of the dot, circumflex, and |
| 825 |
|
dollar metacharacters, the handling of #-comments in /x mode, and, when |
| 826 |
|
CRLF is a recognized line ending sequence, the match position advance- |
| 827 |
|
ment for a non-anchored pattern. The choice of newline convention does |
| 828 |
|
not affect the interpretation of the \n or \r escape sequences. |
| 829 |
|
|
| 830 |
|
|
| 831 |
MULTITHREADING |
MULTITHREADING |
| 832 |
|
|
| 833 |
The PCRE functions can be used in multi-threading applications, with |
The PCRE functions can be used in multi-threading applications, with |
| 834 |
the proviso that the memory management functions pointed to by |
the proviso that the memory management functions pointed to by |
| 835 |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
| 836 |
callout function pointed to by pcre_callout, are shared by all threads. |
callout function pointed to by pcre_callout, are shared by all threads. |
| 837 |
|
|
| 838 |
The compiled form of a regular expression is not altered during match- |
The compiled form of a regular expression is not altered during match- |
| 839 |
ing, so the same compiled pattern can safely be used by several threads |
ing, so the same compiled pattern can safely be used by several threads |
| 840 |
at once. |
at once. |
| 841 |
|
|
| 843 |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
| 844 |
|
|
| 845 |
The compiled form of a regular expression can be saved and re-used at a |
The compiled form of a regular expression can be saved and re-used at a |
| 846 |
later time, possibly by a different program, and even on a host other |
later time, possibly by a different program, and even on a host other |
| 847 |
than the one on which it was compiled. Details are given in the |
than the one on which it was compiled. Details are given in the |
| 848 |
pcreprecompile documentation. |
pcreprecompile documentation. |
| 849 |
|
|
| 850 |
|
|
| 852 |
|
|
| 853 |
int pcre_config(int what, void *where); |
int pcre_config(int what, void *where); |
| 854 |
|
|
| 855 |
The function pcre_config() makes it possible for a PCRE client to dis- |
The function pcre_config() makes it possible for a PCRE client to dis- |
| 856 |
cover which optional features have been compiled into the PCRE library. |
cover which optional features have been compiled into the PCRE library. |
| 857 |
The pcrebuild documentation has more details about these optional fea- |
The pcrebuild documentation has more details about these optional fea- |
| 858 |
tures. |
tures. |
| 859 |
|
|
| 860 |
The first argument for pcre_config() is an integer, specifying which |
The first argument for pcre_config() is an integer, specifying which |
| 861 |
information is required; the second argument is a pointer to a variable |
information is required; the second argument is a pointer to a variable |
| 862 |
into which the information is placed. The following information is |
into which the information is placed. The following information is |
| 863 |
available: |
available: |
| 864 |
|
|
| 865 |
PCRE_CONFIG_UTF8 |
PCRE_CONFIG_UTF8 |
| 866 |
|
|
| 867 |
The output is an integer that is set to one if UTF-8 support is avail- |
The output is an integer that is set to one if UTF-8 support is avail- |
| 868 |
able; otherwise it is set to zero. |
able; otherwise it is set to zero. |
| 869 |
|
|
| 870 |
PCRE_CONFIG_UNICODE_PROPERTIES |
PCRE_CONFIG_UNICODE_PROPERTIES |
| 871 |
|
|
| 872 |
The output is an integer that is set to one if support for Unicode |
The output is an integer that is set to one if support for Unicode |
| 873 |
character properties is available; otherwise it is set to zero. |
character properties is available; otherwise it is set to zero. |
| 874 |
|
|
| 875 |
PCRE_CONFIG_NEWLINE |
PCRE_CONFIG_NEWLINE |
| 876 |
|
|
| 877 |
The output is an integer whose value specifies the default character |
The output is an integer whose value specifies the default character |
| 878 |
sequence that is recognized as meaning "newline". The three values that |
sequence that is recognized as meaning "newline". The four values that |
| 879 |
are supported are: 10 for LF, 13 for CR, and 3338 for CRLF. The default |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, and -1 for ANY. |
| 880 |
should normally be the standard sequence for your operating system. |
The default should normally be the standard sequence for your operating |
| 881 |
|
system. |
| 882 |
|
|
| 883 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
| 884 |
|
|
| 947 |
fully relocatable, because it may contain a copy of the tableptr argu- |
fully relocatable, because it may contain a copy of the tableptr argu- |
| 948 |
ment, which is an address (see below). |
ment, which is an address (see below). |
| 949 |
|
|
| 950 |
The options argument contains independent bits that affect the compila- |
The options argument contains various bit settings that affect the com- |
| 951 |
tion. It should be zero if no options are required. The available |
pilation. It should be zero if no options are required. The available |
| 952 |
options are described below. Some of them, in particular, those that |
options are described below. Some of them, in particular, those that |
| 953 |
are compatible with Perl, can also be set and unset from within the |
are compatible with Perl, can also be set and unset from within the |
| 954 |
pattern (see the detailed description in the pcrepattern documenta- |
pattern (see the detailed description in the pcrepattern documenta- |
| 1038 |
not match when the current position is at a newline. This option is |
not match when the current position is at a newline. This option is |
| 1039 |
equivalent to Perl's /s option, and it can be changed within a pattern |
equivalent to Perl's /s option, and it can be changed within a pattern |
| 1040 |
by a (?s) option setting. A negative class such as [^a] always matches |
by a (?s) option setting. A negative class such as [^a] always matches |
| 1041 |
newlines, independent of the setting of this option. |
newline characters, independent of the setting of this option. |
| 1042 |
|
|
| 1043 |
PCRE_DUPNAMES |
PCRE_DUPNAMES |
| 1044 |
|
|
| 1102 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
| 1103 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
| 1104 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
| 1105 |
|
PCRE_NEWLINE_ANY |
| 1106 |
|
|
| 1107 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
| 1108 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
| 1109 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
| 1110 |
Setting both of them specifies that a newline is indicated by the two- |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
| 1111 |
character CRLF sequence. For convenience, PCRE_NEWLINE_CRLF is defined |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANY specifies that |
| 1112 |
to contain both bits. The only time that a line break is relevant when |
any Unicode newline sequence should be recognized. The Unicode newline |
| 1113 |
compiling a pattern is if PCRE_EXTENDED is set, and an unescaped # out- |
sequences are the three just mentioned, plus the single characters VT |
| 1114 |
side a character class is encountered. This indicates a comment that |
(vertical tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), |
| 1115 |
lasts until after the next newline. |
LS (line separator, U+2028), and PS (paragraph separator, U+2029). The |
| 1116 |
|
last two are recognized only in UTF-8 mode. |
| 1117 |
|
|
| 1118 |
|
The newline setting in the options word uses three bits that are |
| 1119 |
|
treated as a number, giving eight possibilities. Currently only five |
| 1120 |
|
are used (default plus the four values above). This means that if you |
| 1121 |
|
set more than one newline option, the combination may or may not be |
| 1122 |
|
sensible. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equiva- |
| 1123 |
|
lent to PCRE_NEWLINE_CRLF, but other combinations yield unused numbers |
| 1124 |
|
and cause an error. |
| 1125 |
|
|
| 1126 |
|
The only time that a line break is specially recognized when compiling |
| 1127 |
|
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
| 1128 |
|
character class is encountered. This indicates a comment that lasts |
| 1129 |
|
until after the next line break sequence. In other circumstances, line |
| 1130 |
|
break sequences are treated as literal data, except that in |
| 1131 |
|
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
| 1132 |
|
and are therefore ignored. |
| 1133 |
|
|
| 1134 |
The newline option set at compile time becomes the default that is used |
The newline option that is set at compile time becomes the default that |
| 1135 |
for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
| 1136 |
|
|
| 1137 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
| 1138 |
|
|
| 1175 |
|
|
| 1176 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
| 1177 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
| 1178 |
both compiling functions. |
both compiling functions. As PCRE has developed, some error codes have |
| 1179 |
|
fallen out of use. To avoid confusion, they have not been re-used. |
| 1180 |
|
|
| 1181 |
0 no error |
0 no error |
| 1182 |
1 \ at end of pattern |
1 \ at end of pattern |
| 1188 |
7 invalid escape sequence in character class |
7 invalid escape sequence in character class |
| 1189 |
8 range out of order in character class |
8 range out of order in character class |
| 1190 |
9 nothing to repeat |
9 nothing to repeat |
| 1191 |
10 operand of unlimited repeat could match the empty string |
10 [this code is not in use] |
| 1192 |
11 internal error: unexpected repeat |
11 internal error: unexpected repeat |
| 1193 |
12 unrecognized character after (? |
12 unrecognized character after (? |
| 1194 |
13 POSIX named classes are supported only within a class |
13 POSIX named classes are supported only within a class |
| 1197 |
16 erroffset passed as NULL |
16 erroffset passed as NULL |
| 1198 |
17 unknown option bit(s) set |
17 unknown option bit(s) set |
| 1199 |
18 missing ) after comment |
18 missing ) after comment |
| 1200 |
19 parentheses nested too deeply |
19 [this code is not in use] |
| 1201 |
20 regular expression too large |
20 regular expression too large |
| 1202 |
21 failed to get memory |
21 failed to get memory |
| 1203 |
22 unmatched parentheses |
22 unmatched parentheses |
| 1211 |
30 unknown POSIX class name |
30 unknown POSIX class name |
| 1212 |
31 POSIX collating elements are not supported |
31 POSIX collating elements are not supported |
| 1213 |
32 this version of PCRE is not compiled with PCRE_UTF8 support |
32 this version of PCRE is not compiled with PCRE_UTF8 support |
| 1214 |
33 spare error |
33 [this code is not in use] |
| 1215 |
34 character value in \x{...} sequence is too large |
34 character value in \x{...} sequence is too large |
| 1216 |
35 invalid condition (?(0) |
35 invalid condition (?(0) |
| 1217 |
36 \C not allowed in lookbehind assertion |
36 \C not allowed in lookbehind assertion |
| 1220 |
39 closing ) for (?C expected |
39 closing ) for (?C expected |
| 1221 |
40 recursive call could loop indefinitely |
40 recursive call could loop indefinitely |
| 1222 |
41 unrecognized character after (?P |
41 unrecognized character after (?P |
| 1223 |
42 syntax error after (?P |
42 syntax error in subpattern name (missing terminator) |
| 1224 |
43 two named subpatterns have the same name |
43 two named subpatterns have the same name |
| 1225 |
44 invalid UTF-8 string |
44 invalid UTF-8 string |
| 1226 |
45 support for \P, \p, and \X has not been compiled |
45 support for \P, \p, and \X has not been compiled |
| 1230 |
49 too many named subpatterns (maximum 10,000) |
49 too many named subpatterns (maximum 10,000) |
| 1231 |
50 repeated subpattern is too long |
50 repeated subpattern is too long |
| 1232 |
51 octal value is greater than \377 (not in UTF-8 mode) |
51 octal value is greater than \377 (not in UTF-8 mode) |
| 1233 |
|
52 internal error: overran compiling workspace |
| 1234 |
|
53 internal error: previously-checked referenced subpattern not |
| 1235 |
|
found |
| 1236 |
|
54 DEFINE group contains more than one branch |
| 1237 |
|
55 repeating a DEFINE group is not allowed |
| 1238 |
|
56 inconsistent NEWLINE options" |
| 1239 |
|
|
| 1240 |
|
|
| 1241 |
STUDYING A PATTERN |
STUDYING A PATTERN |
| 1394 |
is still recognized for backwards compatibility.) |
is still recognized for backwards compatibility.) |
| 1395 |
|
|
| 1396 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
| 1397 |
(cat|cow|coyote). Otherwise, if either |
(cat|cow|coyote), its value is returned. Otherwise, if either |
| 1398 |
|
|
| 1399 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
| 1400 |
branch starts with "^", or |
branch starts with "^", or |
| 1451 |
PCRE_EXTENDED is set, so white space - including newlines - is |
PCRE_EXTENDED is set, so white space - including newlines - is |
| 1452 |
ignored): |
ignored): |
| 1453 |
|
|
| 1454 |
(?P<date> (?P<year>(\d\d)?\d\d) - |
(?<date> (?<year>(\d\d)?\d\d) - |
| 1455 |
(?P<month>\d\d) - (?P<day>\d\d) ) |
(?<month>\d\d) - (?<day>\d\d) ) |
| 1456 |
|
|
| 1457 |
There are four named subpatterns, so the table has four entries, and |
There are four named subpatterns, so the table has four entries, and |
| 1458 |
each entry in the table is eight bytes long. The table is as follows, |
each entry in the table is eight bytes long. The table is as follows, |
| 1679 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
| 1680 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
| 1681 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
| 1682 |
|
PCRE_NEWLINE_ANY |
| 1683 |
|
|
| 1684 |
These options override the newline definition that was chosen or |
These options override the newline definition that was chosen or |
| 1685 |
defaulted when the pattern was compiled. For details, see the descrip- |
defaulted when the pattern was compiled. For details, see the descrip- |
| 1686 |
tion pcre_compile() above. During matching, the newline choice affects |
tion of pcre_compile() above. During matching, the newline choice |
| 1687 |
the behaviour of the dot, circumflex, and dollar metacharacters. |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
| 1688 |
|
ters. It may also alter the way the match position is advanced after a |
| 1689 |
|
match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF or |
| 1690 |
|
PCRE_NEWLINE_ANY is set, and a match attempt fails when the current |
| 1691 |
|
position is at a CRLF sequence, the match position is advanced by two |
| 1692 |
|
characters instead of one, in other words, to after the CRLF. |
| 1693 |
|
|
| 1694 |
PCRE_NOTBOL |
PCRE_NOTBOL |
| 1695 |
|
|
| 1696 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
| 1697 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
| 1698 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
| 1699 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
| 1700 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
| 1701 |
|
|
| 1702 |
PCRE_NOTEOL |
PCRE_NOTEOL |
| 1703 |
|
|
| 1704 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
| 1705 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
| 1706 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
| 1707 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
| 1708 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
| 1709 |
not affect \Z or \z. |
not affect \Z or \z. |
| 1710 |
|
|
| 1711 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
| 1712 |
|
|
| 1713 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
| 1714 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
| 1715 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
| 1716 |
example, if the pattern |
example, if the pattern |
| 1717 |
|
|
| 1718 |
a?b? |
a?b? |
| 1719 |
|
|
| 1720 |
is applied to a string not beginning with "a" or "b", it matches the |
is applied to a string not beginning with "a" or "b", it matches the |
| 1721 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
| 1722 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
| 1723 |
rences of "a" or "b". |
rences of "a" or "b". |
| 1724 |
|
|
| 1725 |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
| 1726 |
cial case of a pattern match of the empty string within its split() |
cial case of a pattern match of the empty string within its split() |
| 1727 |
function, and when using the /g modifier. It is possible to emulate |
function, and when using the /g modifier. It is possible to emulate |
| 1728 |
Perl's behaviour after matching a null string by first trying the match |
Perl's behaviour after matching a null string by first trying the match |
| 1729 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
| 1730 |
if that fails by advancing the starting offset (see below) and trying |
if that fails by advancing the starting offset (see below) and trying |
| 1731 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
| 1732 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
| 1733 |
|
|
| 1734 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
| 1735 |
|
|
| 1736 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
| 1737 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
| 1738 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
| 1739 |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
| 1740 |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
| 1741 |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
| 1742 |
returned. |
returned. |
| 1743 |
|
|
| 1744 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
| 1745 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
| 1746 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
| 1747 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
| 1748 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
| 1749 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
| 1750 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
| 1751 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
| 1752 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
| 1753 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
| 1754 |
|
|
| 1755 |
PCRE_PARTIAL |
PCRE_PARTIAL |
| 1756 |
|
|
| 1757 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
| 1758 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
| 1759 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
| 1760 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
| 1761 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
| 1762 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
| 1763 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
| 1764 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
| 1765 |
|
|
| 1766 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
| 1767 |
|
|
| 1768 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
| 1769 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
| 1770 |
mode, the byte offset must point to the start of a UTF-8 character. |
mode, the byte offset must point to the start of a UTF-8 character. |
| 1771 |
Unlike the pattern string, the subject may contain binary zero bytes. |
Unlike the pattern string, the subject may contain binary zero bytes. |
| 1772 |
When the starting offset is zero, the search for a match starts at the |
When the starting offset is zero, the search for a match starts at the |
| 1773 |
beginning of the subject, and this is by far the most common case. |
beginning of the subject, and this is by far the most common case. |
| 1774 |
|
|
| 1775 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
| 1776 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
| 1777 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
| 1778 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
| 1779 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
| 1780 |
|
|
| 1781 |
\Biss\B |
\Biss\B |
| 1782 |
|
|
| 1783 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
| 1784 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
| 1785 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
| 1786 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
| 1787 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
| 1788 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
| 1789 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
| 1790 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
| 1791 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
| 1792 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
| 1793 |
|
|
| 1794 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
| 1795 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
| 1796 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
| 1797 |
subject. |
subject. |
| 1798 |
|
|
| 1799 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
| 1800 |
|
|
| 1801 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
| 1802 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
| 1803 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
| 1804 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
| 1805 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
| 1806 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
| 1807 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
| 1808 |
|
|
| 1809 |
Captured substrings are returned to the caller via a vector of integer |
Captured substrings are returned to the caller via a vector of integer |
| 1810 |
offsets whose address is passed in ovector. The number of elements in |
offsets whose address is passed in ovector. The number of elements in |
| 1811 |
the vector is passed in ovecsize, which must be a non-negative number. |
the vector is passed in ovecsize, which must be a non-negative number. |
| 1812 |
Note: this argument is NOT the size of ovector in bytes. |
Note: this argument is NOT the size of ovector in bytes. |
| 1813 |
|
|
| 1814 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
| 1815 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
| 1816 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
| 1817 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
| 1818 |
The length passed in ovecsize should always be a multiple of three. If |
The length passed in ovecsize should always be a multiple of three. If |
| 1819 |
it is not, it is rounded down. |
it is not, it is rounded down. |
| 1820 |
|
|
| 1821 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
| 1822 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
| 1823 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
| 1824 |
element of a pair is set to the offset of the first character in a sub- |
element of a pair is set to the offset of the first character in a sub- |
| 1825 |
string, and the second is set to the offset of the first character |
string, and the second is set to the offset of the first character |
| 1826 |
after the end of a substring. The first pair, ovector[0] and ovec- |
after the end of a substring. The first pair, ovector[0] and ovec- |
| 1827 |
tor[1], identify the portion of the subject string matched by the |
tor[1], identify the portion of the subject string matched by the |
| 1828 |
entire pattern. The next pair is used for the first capturing subpat- |
entire pattern. The next pair is used for the first capturing subpat- |
| 1829 |
tern, and so on. The value returned by pcre_exec() is one more than the |
tern, and so on. The value returned by pcre_exec() is one more than the |
| 1830 |
highest numbered pair that has been set. For example, if two substrings |
highest numbered pair that has been set. For example, if two substrings |
| 1831 |
have been captured, the returned value is 3. If there are no capturing |
have been captured, the returned value is 3. If there are no capturing |
| 1832 |
subpatterns, the return value from a successful match is 1, indicating |
subpatterns, the return value from a successful match is 1, indicating |
| 1833 |
that just the first pair of offsets has been set. |
that just the first pair of offsets has been set. |
| 1834 |
|
|
| 1835 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
| 1836 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
| 1837 |
|
|
| 1838 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
| 1839 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
| 1840 |
function returns a value of zero. In particular, if the substring off- |
function returns a value of zero. In particular, if the substring off- |
| 1841 |
sets are not of interest, pcre_exec() may be called with ovector passed |
sets are not of interest, pcre_exec() may be called with ovector passed |
| 1842 |
as NULL and ovecsize as zero. However, if the pattern contains back |
as NULL and ovecsize as zero. However, if the pattern contains back |
| 1843 |
references and the ovector is not big enough to remember the related |
references and the ovector is not big enough to remember the related |
| 1844 |
substrings, PCRE has to get additional memory for use during matching. |
substrings, PCRE has to get additional memory for use during matching. |
| 1845 |
Thus it is usually advisable to supply an ovector. |
Thus it is usually advisable to supply an ovector. |
| 1846 |
|
|
| 1847 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
| 1848 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
| 1849 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
| 1850 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
| 1851 |
|
|
| 1852 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
| 1853 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
| 1854 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
| 1855 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
| 1856 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
| 1857 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
| 1858 |
|
|
| 1859 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
| 1860 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
| 1861 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
| 1862 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
| 1863 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
| 1864 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
| 1865 |
the vector is large enough, of course). |
the vector is large enough, of course). |
| 1866 |
|
|
| 1867 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
| 1868 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
| 1869 |
|
|
| 1870 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
| 1871 |
|
|
| 1872 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
| 1873 |
defined in the header file: |
defined in the header file: |
| 1874 |
|
|
| 1875 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
| 1878 |
|
|
| 1879 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
| 1880 |
|
|
| 1881 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
| 1882 |
ovecsize was not zero. |
ovecsize was not zero. |
| 1883 |
|
|
| 1884 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
| 1887 |
|
|
| 1888 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
| 1889 |
|
|
| 1890 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
| 1891 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
| 1892 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
| 1893 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
| 1894 |
gives when the magic number is not present. |
gives when the magic number is not present. |
| 1895 |
|
|
| 1896 |
PCRE_ERROR_UNKNOWN_NODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
| 1897 |
|
|
| 1898 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
| 1899 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
| 1900 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
| 1901 |
|
|
| 1902 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 1903 |
|
|
| 1904 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
| 1905 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
| 1906 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
| 1907 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
| 1908 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
| 1909 |
|
|
| 1910 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 1911 |
|
|
| 1912 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
| 1913 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
| 1914 |
returned by pcre_exec(). |
returned by pcre_exec(). |
| 1915 |
|
|
| 1916 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
| 1917 |
|
|
| 1918 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
| 1919 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
| 1920 |
above. |
above. |
| 1921 |
|
|
|
PCRE_ERROR_RECURSIONLIMIT (-21) |
|
|
|
|
|
The internal recursion limit, as specified by the match_limit_recursion |
|
|
field in a pcre_extra structure (or defaulted) was reached. See the |
|
|
description above. |
|
|
|
|
| 1922 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
| 1923 |
|
|
| 1924 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
| 1925 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
| 1926 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
| 1927 |
|
|
| 1928 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
| 1929 |
|
|
| 1930 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
| 1931 |
subject. |
subject. |
| 1932 |
|
|
| 1933 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
| 1934 |
|
|
| 1935 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
| 1936 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
| 1937 |
ter. |
ter. |
| 1938 |
|
|
| 1939 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
| 1940 |
|
|
| 1941 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
| 1942 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
| 1943 |
|
|
| 1944 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
| 1945 |
|
|
| 1946 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
| 1947 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
| 1948 |
documentation for details of partial matching. |
documentation for details of partial matching. |
| 1949 |
|
|
| 1950 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
| 1951 |
|
|
| 1952 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
| 1953 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
| 1954 |
|
|
| 1955 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
| 1956 |
|
|
| 1957 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
| 1958 |
|
|
| 1959 |
|
PCRE_ERROR_RECURSIONLIMIT (-21) |
| 1960 |
|
|
| 1961 |
|
The internal recursion limit, as specified by the match_limit_recursion |
| 1962 |
|
field in a pcre_extra structure (or defaulted) was reached. See the |
| 1963 |
|
description above. |
| 1964 |
|
|
| 1965 |
|
PCRE_ERROR_NULLWSLIMIT (-22) |
| 1966 |
|
|
| 1967 |
|
When a group that can match an empty substring is repeated with an |
| 1968 |
|
unbounded upper limit, the subject position at the start of the group |
| 1969 |
|
must be remembered, so that a test for an empty string can be made when |
| 1970 |
|
the end of the group is reached. Some workspace is required for this; |
| 1971 |
|
if it runs out, this error is given. |
| 1972 |
|
|
| 1973 |
|
PCRE_ERROR_BADNEWLINE (-23) |
| 1974 |
|
|
| 1975 |
|
An invalid combination of PCRE_NEWLINE_xxx options was given. |
| 1976 |
|
|
| 1977 |
|
Error numbers -16 to -20 are not used by pcre_exec(). |
| 1978 |
|
|
| 1979 |
|
|
| 1980 |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
| 1990 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
| 1991 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
| 1992 |
|
|
| 1993 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
| 1994 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
| 1995 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
| 1996 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
| 1997 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
| 1998 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
| 1999 |
substrings. |
substrings. |
| 2000 |
|
|
| 2001 |
A substring that contains a binary zero is correctly extracted and has |
A substring that contains a binary zero is correctly extracted and has |
| 2002 |
a further zero added on the end, but the result is not, of course, a C |
a further zero added on the end, but the result is not, of course, a C |
| 2003 |
string. However, you can process such a string by referring to the |
string. However, you can process such a string by referring to the |
| 2004 |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
| 2005 |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
| 2006 |
not adequate for handling strings containing binary zeros, because the |
not adequate for handling strings containing binary zeros, because the |
| 2007 |
end of the final string is not independently indicated. |
end of the final string is not independently indicated. |
| 2008 |
|
|
| 2009 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
| 2010 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
| 2011 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
| 2012 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
| 2013 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
| 2014 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
| 2015 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
| 2016 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
| 2017 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
| 2018 |
|
|
| 2019 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
| 2020 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
| 2021 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
| 2022 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
| 2023 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
| 2024 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
| 2025 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
| 2026 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
| 2027 |
the terminating zero, or one of |
the terminating zero, or one of these error codes: |
| 2028 |
|
|
| 2029 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2030 |
|
|
| 2031 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
| 2032 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
| 2033 |
|
|
| 2034 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 2035 |
|
|
| 2036 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
| 2037 |
|
|
| 2038 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
| 2039 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
| 2040 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
| 2041 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
| 2042 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
| 2043 |
pointer. The yield of the function is zero if all went well, or |
pointer. The yield of the function is zero if all went well, or the |
| 2044 |
|
error code |
| 2045 |
|
|
| 2046 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2047 |
|
|
| 2083 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
| 2084 |
ber. For example, for this pattern |
ber. For example, for this pattern |
| 2085 |
|
|
| 2086 |
(a+)b(?P<xxx>\d+)... |
(a+)b(?<xxx>\d+)... |
| 2087 |
|
|
| 2088 |
the number of the subpattern called "xxx" is 2. If the name is known to |
the number of the subpattern called "xxx" is 2. If the name is known to |
| 2089 |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
| 2134 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
| 2135 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
| 2136 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
| 2137 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING if there |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
| 2138 |
are none. The format of the table is described above in the section |
there are none. The format of the table is described above in the sec- |
| 2139 |
entitled Information about a pattern. Given all the relevant entries |
tion entitled Information about a pattern. Given all the relevant |
| 2140 |
for the name, you can extract each of their numbers, and hence the cap- |
entries for the name, you can extract each of their numbers, and hence |
| 2141 |
tured data, if any. |
the captured data, if any. |
| 2142 |
|
|
| 2143 |
|
|
| 2144 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
| 2167 |
int *workspace, int wscount); |
int *workspace, int wscount); |
| 2168 |
|
|
| 2169 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
| 2170 |
against a compiled pattern, using a "DFA" matching algorithm. This has |
against a compiled pattern, using a matching algorithm that scans the |
| 2171 |
different characteristics to the normal algorithm, and is not compati- |
subject string just once, and does not backtrack. This has different |
| 2172 |
ble with Perl. Some of the features of PCRE patterns are not supported. |
characteristics to the normal algorithm, and is not compatible with |
| 2173 |
Nevertheless, there are times when this kind of matching can be useful. |
Perl. Some of the features of PCRE patterns are not supported. Never- |
| 2174 |
For a discussion of the two matching algorithms, see the pcrematching |
theless, there are times when this kind of matching can be useful. For |
| 2175 |
documentation. |
a discussion of the two matching algorithms, see the pcrematching docu- |
| 2176 |
|
mentation. |
| 2177 |
|
|
| 2178 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
| 2179 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
| 2180 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
| 2181 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
| 2182 |
repeated here. |
repeated here. |
| 2183 |
|
|
| 2184 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
| 2185 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
| 2186 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
| 2187 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
| 2188 |
lot of potential matches. |
lot of potential matches. |
| 2189 |
|
|
| 2190 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
| 2206 |
|
|
| 2207 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
| 2208 |
|
|
| 2209 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
| 2210 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
| 2211 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
| 2212 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
| 2213 |
three of these are the same as for pcre_exec(), so their description is |
three of these are the same as for pcre_exec(), so their description is |
| 2214 |
not repeated here. |
not repeated here. |
| 2215 |
|
|
| 2216 |
PCRE_PARTIAL |
PCRE_PARTIAL |
| 2217 |
|
|
| 2218 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
| 2219 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
| 2220 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
| 2221 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
| 2222 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
| 2223 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
| 2224 |
set as the first matching string. |
set as the first matching string. |
| 2225 |
|
|
| 2226 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
| 2227 |
|
|
| 2228 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
| 2229 |
stop as soon as it has found one match. Because of the way the DFA |
stop as soon as it has found one match. Because of the way the alterna- |
| 2230 |
algorithm works, this is necessarily the shortest possible match at the |
tive algorithm works, this is necessarily the shortest possible match |
| 2231 |
first possible matching point in the subject string. |
at the first possible matching point in the subject string. |
| 2232 |
|
|
| 2233 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
| 2234 |
|
|
| 2235 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
| 2236 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
| 2237 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
| 2238 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
| 2239 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
| 2240 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
| 2241 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
| 2242 |
documentation. |
documentation. |
| 2243 |
|
|
| 2244 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
| 2245 |
|
|
| 2246 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
| 2247 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
| 2248 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
| 2249 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
| 2250 |
if the pattern |
if the pattern |
| 2251 |
|
|
| 2252 |
<.*> |
<.*> |
| 2261 |
<something> <something else> |
<something> <something else> |
| 2262 |
<something> <something else> <something further> |
<something> <something else> <something further> |
| 2263 |
|
|
| 2264 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
| 2265 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
| 2266 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
| 2267 |
the offset to the start, and the second is the offset to the end. All |
the offset to the start, and the second is the offset to the end. In |
| 2268 |
the strings have the same start offset. (Space could have been saved by |
fact, all the strings have the same start offset. (Space could have |
| 2269 |
giving this only once, but it was decided to retain some compatibility |
been saved by giving this only once, but it was decided to retain some |
| 2270 |
with the way pcre_exec() returns data, even though the meaning of the |
compatibility with the way pcre_exec() returns data, even though the |
| 2271 |
strings is different.) |
meaning of the strings is different.) |
| 2272 |
|
|
| 2273 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
| 2274 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
| 2275 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
| 2276 |
filled with the longest matches. |
filled with the longest matches. |
| 2277 |
|
|
| 2278 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
| 2279 |
|
|
| 2280 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
| 2281 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
| 2282 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
| 2283 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
| 2284 |
|
|
| 2285 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
| 2286 |
|
|
| 2287 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
| 2288 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
| 2289 |
reference. |
reference. |
| 2290 |
|
|
| 2291 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
| 2292 |
|
|
| 2293 |
This return is given if pcre_dfa_exec() encounters a condition item in |
This return is given if pcre_dfa_exec() encounters a condition item |
| 2294 |
a pattern that uses a back reference for the condition. This is not |
that uses a back reference for the condition, or a test for recursion |
| 2295 |
supported. |
in a specific group. These are not supported. |
| 2296 |
|
|
| 2297 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
| 2298 |
|
|
| 2299 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
| 2300 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
| 2301 |
(it is meaningless). |
(it is meaningless). |
| 2302 |
|
|
| 2303 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
| 2304 |
|
|
| 2305 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
| 2306 |
workspace vector. |
workspace vector. |
| 2307 |
|
|
| 2308 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
| 2309 |
|
|
| 2310 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
| 2311 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
| 2312 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
| 2313 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
| 2314 |
|
|
| 2315 |
Last updated: 08 June 2006 |
|
| 2316 |
|
SEE ALSO |
| 2317 |
|
|
| 2318 |
|
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
| 2319 |
|
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
| 2320 |
|
|
| 2321 |
|
Last updated: 30 November 2006 |
| 2322 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 2323 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2324 |
|
|
| 2492 |
DIFFERENCES BETWEEN PCRE AND PERL |
DIFFERENCES BETWEEN PCRE AND PERL |
| 2493 |
|
|
| 2494 |
This document describes the differences in the ways that PCRE and Perl |
This document describes the differences in the ways that PCRE and Perl |
| 2495 |
handle regular expressions. The differences described here are with |
handle regular expressions. The differences described here are mainly |
| 2496 |
respect to Perl 5.8. |
with respect to Perl 5.8, though PCRE version 7.0 contains some fea- |
| 2497 |
|
tures that are expected to be in the forthcoming Perl 5.10. |
| 2498 |
|
|
| 2499 |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
| 2500 |
of what it does have are given in the section on UTF-8 support in the |
of what it does have are given in the section on UTF-8 support in the |
| 2501 |
main pcre page. |
main pcre page. |
| 2502 |
|
|
| 2503 |
2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl |
2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl |
| 2504 |
permits them, but they do not mean what you might think. For example, |
permits them, but they do not mean what you might think. For example, |
| 2505 |
(?!a){3} does not assert that the next three characters are not "a". It |
(?!a){3} does not assert that the next three characters are not "a". It |
| 2506 |
just asserts that the next character is not "a" three times. |
just asserts that the next character is not "a" three times. |
| 2507 |
|
|
| 2508 |
3. Capturing subpatterns that occur inside negative lookahead asser- |
3. Capturing subpatterns that occur inside negative lookahead asser- |
| 2509 |
tions are counted, but their entries in the offsets vector are never |
tions are counted, but their entries in the offsets vector are never |
| 2510 |
set. Perl sets its numerical variables from any such patterns that are |
set. Perl sets its numerical variables from any such patterns that are |
| 2511 |
matched before the assertion fails to match something (thereby succeed- |
matched before the assertion fails to match something (thereby succeed- |
| 2512 |
ing), but only if the negative lookahead assertion contains just one |
ing), but only if the negative lookahead assertion contains just one |
| 2513 |
branch. |
branch. |
| 2514 |
|
|
| 2515 |
4. Though binary zero characters are supported in the subject string, |
4. Though binary zero characters are supported in the subject string, |
| 2516 |
they are not allowed in a pattern string because it is passed as a nor- |
they are not allowed in a pattern string because it is passed as a nor- |
| 2517 |
mal C string, terminated by zero. The escape sequence \0 can be used in |
mal C string, terminated by zero. The escape sequence \0 can be used in |
| 2518 |
the pattern to represent a binary zero. |
the pattern to represent a binary zero. |
| 2519 |
|
|
| 2520 |
5. The following Perl escape sequences are not supported: \l, \u, \L, |
5. The following Perl escape sequences are not supported: \l, \u, \L, |
| 2521 |
\U, and \N. In fact these are implemented by Perl's general string-han- |
\U, and \N. In fact these are implemented by Perl's general string-han- |
| 2522 |
dling and are not part of its pattern matching engine. If any of these |
dling and are not part of its pattern matching engine. If any of these |
| 2523 |
are encountered by PCRE, an error is generated. |
are encountered by PCRE, an error is generated. |
| 2524 |
|
|
| 2525 |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
| 2526 |
is built with Unicode character property support. The properties that |
is built with Unicode character property support. The properties that |
| 2527 |
can be tested with \p and \P are limited to the general category prop- |
can be tested with \p and \P are limited to the general category prop- |
| 2528 |
erties such as Lu and Nd, script names such as Greek or Han, and the |
erties such as Lu and Nd, script names such as Greek or Han, and the |
| 2529 |
derived properties Any and L&. |
derived properties Any and L&. |
| 2530 |
|
|
| 2531 |
7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
| 2532 |
ters in between are treated as literals. This is slightly different |
ters in between are treated as literals. This is slightly different |
| 2533 |
from Perl in that $ and @ are also handled as literals inside the |
from Perl in that $ and @ are also handled as literals inside the |
| 2534 |
quotes. In Perl, they cause variable interpolation (but of course PCRE |
quotes. In Perl, they cause variable interpolation (but of course PCRE |
| 2535 |
does not have variables). Note the following examples: |
does not have variables). Note the following examples: |
| 2536 |
|
|
| 2537 |
Pattern PCRE matches Perl matches |
Pattern PCRE matches Perl matches |
| 2541 |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
| 2542 |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| 2543 |
|
|
| 2544 |
The \Q...\E sequence is recognized both inside and outside character |
The \Q...\E sequence is recognized both inside and outside character |
| 2545 |
classes. |
classes. |
| 2546 |
|
|
| 2547 |
8. Fairly obviously, PCRE does not support the (?{code}) and (?p{code}) |
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) |
| 2548 |
constructions. However, there is support for recursive patterns using |
constructions. However, there is support for recursive patterns. This |
| 2549 |
the non-Perl items (?R), (?number), and (?P>name). Also, the PCRE |
is not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE |
| 2550 |
"callout" feature allows an external function to be called during pat- |
"callout" feature allows an external function to be called during pat- |
| 2551 |
tern matching. See the pcrecallout documentation for details. |
tern matching. See the pcrecallout documentation for details. |
| 2552 |
|
|
| 2553 |
9. There are some differences that are concerned with the settings of |
9. Subpatterns that are called recursively or as "subroutines" are |
| 2554 |
captured strings when part of a pattern is repeated. For example, |
always treated as atomic groups in PCRE. This is like Python, but |
| 2555 |
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 |
unlike Perl. |
| 2556 |
|
|
| 2557 |
|
10. There are some differences that are concerned with the settings of |
| 2558 |
|
captured strings when part of a pattern is repeated. For example, |
| 2559 |
|
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 |
| 2560 |
unset, but in PCRE it is set to "b". |
unset, but in PCRE it is set to "b". |
| 2561 |
|
|
| 2562 |
10. PCRE provides some extensions to the Perl regular expression facil- |
11. PCRE provides some extensions to the Perl regular expression facil- |
| 2563 |
ities: |
ities. Perl 5.10 will include new features that are not in earlier |
| 2564 |
|
versions, some of which (such as named parentheses) have been in PCRE |
| 2565 |
|
for some time. This list is with respect to Perl 5.10: |
| 2566 |
|
|
| 2567 |
(a) Although lookbehind assertions must match fixed length strings, |
(a) Although lookbehind assertions must match fixed length strings, |
| 2568 |
each alternative branch of a lookbehind assertion can match a different |
each alternative branch of a lookbehind assertion can match a different |
| 2569 |
length of string. Perl requires them all to have the same length. |
length of string. Perl requires them all to have the same length. |
| 2570 |
|
|
| 2571 |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ |
| 2572 |
meta-character matches only at the very end of the string. |
meta-character matches only at the very end of the string. |
| 2573 |
|
|
| 2574 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- |
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- |
| 2575 |
cial meaning is faulted. Otherwise, like Perl, the backslash is |
cial meaning is faulted. Otherwise, like Perl, the backslash is |
| 2576 |
ignored. (Perl can be made to issue a warning.) |
ignored. (Perl can be made to issue a warning.) |
| 2577 |
|
|
| 2578 |
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- |
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- |
| 2579 |
fiers is inverted, that is, by default they are not greedy, but if fol- |
fiers is inverted, that is, by default they are not greedy, but if fol- |
| 2580 |
lowed by a question mark they are. |
lowed by a question mark they are. |
| 2581 |
|
|
| 2582 |
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
| 2583 |
tried only at the first matching position in the subject string. |
tried only at the first matching position in the subject string. |
| 2584 |
|
|
| 2585 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP- |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP- |
| 2586 |
TURE options for pcre_exec() have no Perl equivalents. |
TURE options for pcre_exec() have no Perl equivalents. |
| 2587 |
|
|
| 2588 |
(g) The (?R), (?number), and (?P>name) constructs allows for recursive |
(g) The callout facility is PCRE-specific. |
|
pattern matching (Perl can do this using the (?p{code}) construct, |
|
|
which PCRE cannot support.) |
|
|
|
|
|
(h) PCRE supports named capturing substrings, using the Python syntax. |
|
|
|
|
|
(i) PCRE supports the possessive quantifier "++" syntax, taken from |
|
|
Sun's Java package. |
|
|
|
|
|
(j) The (R) condition, for testing recursion, is a PCRE extension. |
|
|
|
|
|
(k) The callout facility is PCRE-specific. |
|
| 2589 |
|
|
| 2590 |
(l) The partial matching facility is PCRE-specific. |
(h) The partial matching facility is PCRE-specific. |
| 2591 |
|
|
| 2592 |
(m) Patterns compiled by PCRE can be saved and re-used at a later time, |
(i) Patterns compiled by PCRE can be saved and re-used at a later time, |
| 2593 |
even on different hosts that have the other endianness. |
even on different hosts that have the other endianness. |
| 2594 |
|
|
| 2595 |
(n) The alternative matching function (pcre_dfa_exec()) matches in a |
(j) The alternative matching function (pcre_dfa_exec()) matches in a |
| 2596 |
different way and is not Perl-compatible. |
different way and is not Perl-compatible. |
| 2597 |
|
|
| 2598 |
Last updated: 06 June 2006 |
Last updated: 28 November 2006 |
| 2599 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 2600 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2601 |
|
|
| 2632 |
function, and how it differs from the normal function, are discussed in |
function, and how it differs from the normal function, are discussed in |
| 2633 |
the pcrematching page. |
the pcrematching page. |
| 2634 |
|
|
| 2635 |
|
|
| 2636 |
|
CHARACTERS AND METACHARACTERS |
| 2637 |
|
|
| 2638 |
A regular expression is a pattern that is matched against a subject |
A regular expression is a pattern that is matched against a subject |
| 2639 |
string from left to right. Most characters stand for themselves in a |
string from left to right. Most characters stand for themselves in a |
| 2640 |
pattern, and match the corresponding characters in the subject. As a |
pattern, and match the corresponding characters in the subject. As a |
| 2659 |
|
|
| 2660 |
There are two different sets of metacharacters: those that are recog- |
There are two different sets of metacharacters: those that are recog- |
| 2661 |
nized anywhere in the pattern except within square brackets, and those |
nized anywhere in the pattern except within square brackets, and those |
| 2662 |
that are recognized in square brackets. Outside square brackets, the |
that are recognized within square brackets. Outside square brackets, |
| 2663 |
metacharacters are as follows: |
the metacharacters are as follows: |
| 2664 |
|
|
| 2665 |
\ general escape character with several uses |
\ general escape character with several uses |
| 2666 |
^ assert start of string (or line, in multiline mode) |
^ assert start of string (or line, in multiline mode) |
| 2782 |
|
|
| 2783 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
| 2784 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
| 2785 |
up to three octal digits following the backslash, ane uses them to gen- |
up to three octal digits following the backslash, and uses them to gen- |
| 2786 |
erate a data character. Any subsequent digits stand for themselves. In |
erate a data character. Any subsequent digits stand for themselves. In |
| 2787 |
non-UTF-8 mode, the value of a character specified in octal must be |
non-UTF-8 mode, the value of a character specified in octal must be |
| 2788 |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
| 2809 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
| 2810 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
| 2811 |
class, the sequence \b is interpreted as the backspace character (hex |
class, the sequence \b is interpreted as the backspace character (hex |
| 2812 |
08), and the sequence \X is interpreted as the character "X". Outside a |
08), and the sequences \R and \X are interpreted as the characters "R" |
| 2813 |
character class, these sequences have different meanings (see below). |
and "X", respectively. Outside a character class, these sequences have |
| 2814 |
|
different meanings (see below). |
| 2815 |
|
|
| 2816 |
|
Absolute and relative back references |
| 2817 |
|
|
| 2818 |
|
The sequence \g followed by a positive or negative number, optionally |
| 2819 |
|
enclosed in braces, is an absolute or relative back reference. Back |
| 2820 |
|
references are discussed later, following the discussion of parenthe- |
| 2821 |
|
sized subpatterns. |
| 2822 |
|
|
| 2823 |
Generic character types |
Generic character types |
| 2824 |
|
|
| 2825 |
The third use of backslash is for specifying generic character types. |
Another use of backslash is for specifying generic character types. The |
| 2826 |
The following are always recognized: |
following are always recognized: |
| 2827 |
|
|
| 2828 |
\d any decimal digit |
\d any decimal digit |
| 2829 |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
| 2860 |
code character property support is available. The use of locales with |
code character property support is available. The use of locales with |
| 2861 |
Unicode is discouraged. |
Unicode is discouraged. |
| 2862 |
|
|
| 2863 |
|
Newline sequences |
| 2864 |
|
|
| 2865 |
|
Outside a character class, the escape sequence \R matches any Unicode |
| 2866 |
|
newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is |
| 2867 |
|
equivalent to the following: |
| 2868 |
|
|
| 2869 |
|
(?>\r\n|\n|\x0b|\f|\r|\x85) |
| 2870 |
|
|
| 2871 |
|
This is an example of an "atomic group", details of which are given |
| 2872 |
|
below. This particular group matches either the two-character sequence |
| 2873 |
|
CR followed by LF, or one of the single characters LF (linefeed, |
| 2874 |
|
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
| 2875 |
|
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
| 2876 |
|
is treated as a single unit that cannot be split. |
| 2877 |
|
|
| 2878 |
|
In UTF-8 mode, two additional characters whose codepoints are greater |
| 2879 |
|
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
| 2880 |
|
rator, U+2029). Unicode character property support is not needed for |
| 2881 |
|
these characters to be recognized. |
| 2882 |
|
|
| 2883 |
|
Inside a character class, \R matches the letter "R". |
| 2884 |
|
|
| 2885 |
Unicode character properties |
Unicode character properties |
| 2886 |
|
|
| 2887 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
| 2908 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
| 2909 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
| 2910 |
|
|
| 2911 |
Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid, Cana- |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
| 2912 |
dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret, |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
| 2913 |
Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
| 2914 |
Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
| 2915 |
Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam, |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
| 2916 |
Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya, |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
| 2917 |
Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag- |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
| 2918 |
banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
| 2919 |
Ugaritic, Yi. |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
| 2920 |
|
|
| 2921 |
Each character has exactly one general category property, specified by |
Each character has exactly one general category property, specified by |
| 2922 |
a two-letter abbreviation. For compatibility with Perl, negation can be |
a two-letter abbreviation. For compatibility with Perl, negation can be |
| 3009 |
|
|
| 3010 |
Simple assertions |
Simple assertions |
| 3011 |
|
|
| 3012 |
The fourth use of backslash is for certain simple assertions. An asser- |
The final use of backslash is for certain simple assertions. An asser- |
| 3013 |
tion specifies a condition that has to be met at a particular point in |
tion specifies a condition that has to be met at a particular point in |
| 3014 |
a match, without consuming any characters from the subject string. The |
a match, without consuming any characters from the subject string. The |
| 3015 |
use of subpatterns for more complicated assertions is described below. |
use of subpatterns for more complicated assertions is described below. |
| 3017 |
|
|
| 3018 |
\b matches at a word boundary |
\b matches at a word boundary |
| 3019 |
\B matches when not at a word boundary |
\B matches when not at a word boundary |
| 3020 |
\A matches at start of subject |
\A matches at the start of the subject |
| 3021 |
\Z matches at end of subject or before newline at end |
\Z matches at the end of the subject |
| 3022 |
\z matches at end of subject |
also matches before a newline at the end of the subject |
| 3023 |
\G matches at first matching position in subject |
\z matches only at the end of the subject |
| 3024 |
|
\G matches at the first matching position in the subject |
| 3025 |
|
|
| 3026 |
These assertions may not appear in character classes (but note that \b |
These assertions may not appear in character classes (but note that \b |
| 3027 |
has a different meaning, namely the backspace character, inside a char- |
has a different meaning, namely the backspace character, inside a char- |
| 3118 |
Outside a character class, a dot in the pattern matches any one charac- |
Outside a character class, a dot in the pattern matches any one charac- |
| 3119 |
ter in the subject string except (by default) a character that signi- |
ter in the subject string except (by default) a character that signi- |
| 3120 |
fies the end of a line. In UTF-8 mode, the matched character may be |
fies the end of a line. In UTF-8 mode, the matched character may be |
| 3121 |
more than one byte long. When a line ending is defined as a single |
more than one byte long. |
|
character (CR or LF), dot never matches that character; when the two- |
|
|
character sequence CRLF is used, dot does not match CR if it is immedi- |
|
|
ately followed by LF, but otherwise it matches all characters (includ- |
|
|
ing isolated CRs and LFs). |
|
|
|
|
|
The behaviour of dot with regard to newlines can be changed. If the |
|
|
PCRE_DOTALL option is set, a dot matches any one character, without |
|
|
exception. If newline is defined as the two-character sequence CRLF, it |
|
|
takes two dots to match it. |
|
| 3122 |
|
|
| 3123 |
The handling of dot is entirely independent of the handling of circum- |
When a line ending is defined as a single character, dot never matches |
| 3124 |
flex and dollar, the only relationship being that they both involve |
that character; when the two-character sequence CRLF is used, dot does |
| 3125 |
|
not match CR if it is immediately followed by LF, but otherwise it |
| 3126 |
|
matches all characters (including isolated CRs and LFs). When any Uni- |
| 3127 |
|
code line endings are being recognized, dot does not match CR or LF or |
| 3128 |
|
any of the other line ending characters. |
| 3129 |
|
|
| 3130 |
|
The behaviour of dot with regard to newlines can be changed. If the |
| 3131 |
|
PCRE_DOTALL option is set, a dot matches any one character, without |
| 3132 |
|
exception. If the two-character sequence CRLF is present in the subject |
| 3133 |
|
string, it takes two dots to match it. |
| 3134 |
|
|
| 3135 |
|
The handling of dot is entirely independent of the handling of circum- |
| 3136 |
|
flex and dollar, the only relationship being that they both involve |
| 3137 |
newlines. Dot has no special meaning in a character class. |
newlines. Dot has no special meaning in a character class. |
| 3138 |
|
|
| 3139 |
|
|
| 3140 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
| 3141 |
|
|
| 3142 |
Outside a character class, the escape sequence \C matches any one byte, |
Outside a character class, the escape sequence \C matches any one byte, |
| 3143 |
both in and out of UTF-8 mode. Unlike a dot, it always matches CR and |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
| 3144 |
LF. The feature is provided in Perl in order to match individual bytes |
line-ending characters. The feature is provided in Perl in order to |
| 3145 |
in UTF-8 mode. Because it breaks up UTF-8 characters into individual |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
| 3146 |
bytes, what remains in the string may be a malformed UTF-8 string. For |
acters into individual bytes, what remains in the string may be a mal- |
| 3147 |
this reason, the \C escape sequence is best avoided. |
formed UTF-8 string. For this reason, the \C escape sequence is best |
| 3148 |
|
avoided. |
| 3149 |
|
|
| 3150 |
PCRE does not allow \C to appear in lookbehind assertions (described |
PCRE does not allow \C to appear in lookbehind assertions (described |
| 3151 |
below), because in UTF-8 mode this would make it impossible to calcu- |
below), because in UTF-8 mode this would make it impossible to calcu- |
| 3192 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
| 3193 |
support. |
support. |
| 3194 |
|
|
| 3195 |
Characters that might indicate line breaks (CR and LF) are never |
Characters that might indicate line breaks are never treated in any |
| 3196 |
treated in any special way when matching character classes, whatever |
special way when matching character classes, whatever line-ending |
| 3197 |
line-ending sequence is in use, and whatever setting of the PCRE_DOTALL |
sequence is in use, and whatever setting of the PCRE_DOTALL and |
| 3198 |
and PCRE_MULTILINE options is used. A class such as [^a] always matches |
PCRE_MULTILINE options is used. A class such as [^a] always matches one |
| 3199 |
one of these characters. |
of these characters. |
| 3200 |
|
|
| 3201 |
The minus (hyphen) character can be used to specify a range of charac- |
The minus (hyphen) character can be used to specify a range of charac- |
| 3202 |
ters in a character class. For example, [d-m] matches any letter |
ters in a character class. For example, [d-m] matches any letter |
| 3328 |
PCRE extracts it into the global options (and it will therefore show up |
PCRE extracts it into the global options (and it will therefore show up |
| 3329 |
in data extracted by the pcre_fullinfo() function). |
in data extracted by the pcre_fullinfo() function). |
| 3330 |
|
|
| 3331 |
An option change within a subpattern affects only that part of the cur- |
An option change within a subpattern (see below for a description of |
| 3332 |
rent pattern that follows it, so |
subpatterns) affects only that part of the current pattern that follows |
| 3333 |
|
it, so |
| 3334 |
|
|
| 3335 |
(a(?i)b)c |
(a(?i)b)c |
| 3336 |
|
|
| 3337 |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
| 3338 |
used). By this means, options can be made to have different settings |
used). By this means, options can be made to have different settings |
| 3339 |
in different parts of the pattern. Any changes made in one alternative |
in different parts of the pattern. Any changes made in one alternative |
| 3340 |
do carry on into subsequent branches within the same subpattern. For |
do carry on into subsequent branches within the same subpattern. For |
| 3341 |
example, |
example, |
| 3342 |
|
|
| 3343 |
(a(?i)b|c) |
(a(?i)b|c) |
| 3344 |
|
|
| 3345 |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
| 3346 |
first branch is abandoned before the option setting. This is because |
first branch is abandoned before the option setting. This is because |
| 3347 |
the effects of option settings happen at compile time. There would be |
the effects of option settings happen at compile time. There would be |
| 3348 |
some very weird behaviour otherwise. |
some very weird behaviour otherwise. |
| 3349 |
|
|
| 3350 |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
| 3351 |
can be changed in the same way as the Perl-compatible options by using |
can be changed in the same way as the Perl-compatible options by using |
| 3352 |
the characters J, U and X respectively. |
the characters J, U and X respectively. |
| 3353 |
|
|
| 3354 |
|
|
| 3361 |
|
|
| 3362 |
cat(aract|erpillar|) |
cat(aract|erpillar|) |
| 3363 |
|
|
| 3364 |
matches one of the words "cat", "cataract", or "caterpillar". Without |
matches one of the words "cat", "cataract", or "caterpillar". Without |
| 3365 |
the parentheses, it would match "cataract", "erpillar" or the empty |
the parentheses, it would match "cataract", "erpillar" or an empty |
| 3366 |
string. |
string. |
| 3367 |
|
|
| 3368 |
2. It sets up the subpattern as a capturing subpattern. This means |
2. It sets up the subpattern as a capturing subpattern. This means |
| 3369 |
that, when the whole pattern matches, that portion of the subject |
that, when the whole pattern matches, that portion of the subject |
| 3370 |
string that matched the subpattern is passed back to the caller via the |
string that matched the subpattern is passed back to the caller via the |
| 3371 |
ovector argument of pcre_exec(). Opening parentheses are counted from |
ovector argument of pcre_exec(). Opening parentheses are counted from |
| 3372 |
left to right (starting from 1) to obtain numbers for the capturing |
left to right (starting from 1) to obtain numbers for the capturing |
| 3373 |
subpatterns. |
subpatterns. |
| 3374 |
|
|
| 3375 |
For example, if the string "the red king" is matched against the pat- |
For example, if the string "the red king" is matched against the pat- |
| 3376 |
tern |
tern |
| 3377 |
|
|
| 3378 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
| 3380 |
the captured substrings are "red king", "red", and "king", and are num- |
the captured substrings are "red king", "red", and "king", and are num- |
| 3381 |
bered 1, 2, and 3, respectively. |
bered 1, 2, and 3, respectively. |
| 3382 |
|
|
| 3383 |
The fact that plain parentheses fulfil two functions is not always |
The fact that plain parentheses fulfil two functions is not always |
| 3384 |
helpful. There are often times when a grouping subpattern is required |
helpful. There are often times when a grouping subpattern is required |
| 3385 |
without a capturing requirement. If an opening parenthesis is followed |
without a capturing requirement. If an opening parenthesis is followed |
| 3386 |
by a question mark and a colon, the subpattern does not do any captur- |
by a question mark and a colon, the subpattern does not do any captur- |
| 3387 |
ing, and is not counted when computing the number of any subsequent |
ing, and is not counted when computing the number of any subsequent |
| 3388 |
capturing subpatterns. For example, if the string "the white queen" is |
capturing subpatterns. For example, if the string "the white queen" is |
| 3389 |
matched against the pattern |
matched against the pattern |
| 3390 |
|
|
| 3391 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
| 3392 |
|
|
| 3393 |
the captured substrings are "white queen" and "queen", and are numbered |
the captured substrings are "white queen" and "queen", and are numbered |
| 3394 |
1 and 2. The maximum number of capturing subpatterns is 65535, and the |
1 and 2. The maximum number of capturing subpatterns is 65535. |
|
maximum depth of nesting of all subpatterns, both capturing and non- |
|
|
capturing, is 200. |
|
| 3395 |
|
|
| 3396 |
As a convenient shorthand, if any option settings are required at the |
As a convenient shorthand, if any option settings are required at the |
| 3397 |
start of a non-capturing subpattern, the option letters may appear |
start of a non-capturing subpattern, the option letters may appear |
| 3398 |
between the "?" and the ":". Thus the two patterns |
between the "?" and the ":". Thus the two patterns |
| 3399 |
|
|
| 3400 |
(?i:saturday|sunday) |
(?i:saturday|sunday) |
| 3401 |
(?:(?i)saturday|sunday) |
(?:(?i)saturday|sunday) |
| 3402 |
|
|
| 3403 |
match exactly the same set of strings. Because alternative branches are |
match exactly the same set of strings. Because alternative branches are |
| 3404 |
tried from left to right, and options are not reset until the end of |
tried from left to right, and options are not reset until the end of |
| 3405 |
the subpattern is reached, an option setting in one branch does affect |
the subpattern is reached, an option setting in one branch does affect |
| 3406 |
subsequent branches, so the above patterns match "SUNDAY" as well as |
subsequent branches, so the above patterns match "SUNDAY" as well as |
| 3407 |
"Saturday". |
"Saturday". |
| 3408 |
|
|
| 3409 |
|
|
| 3410 |
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
| 3411 |
|
|
| 3412 |
Identifying capturing parentheses by number is simple, but it can be |
Identifying capturing parentheses by number is simple, but it can be |
| 3413 |
very hard to keep track of the numbers in complicated regular expres- |
very hard to keep track of the numbers in complicated regular expres- |
| 3414 |
sions. Furthermore, if an expression is modified, the numbers may |
sions. Furthermore, if an expression is modified, the numbers may |
| 3415 |
change. To help with this difficulty, PCRE supports the naming of sub- |
change. To help with this difficulty, PCRE supports the naming of sub- |
| 3416 |
patterns, something that Perl does not provide. The Python syntax |
patterns. This feature was not added to Perl until release 5.10. Python |
| 3417 |
(?P<name>...) is used. References to capturing parentheses from other |
had the feature earlier, and PCRE introduced it at release 4.0, using |
| 3418 |
parts of the pattern, such as backreferences, recursion, and condi- |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
| 3419 |
tions, can be made by name as well as by number. |
tax. |
| 3420 |
|
|
| 3421 |
Names consist of up to 32 alphanumeric characters and underscores. |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
| 3422 |
Named capturing parentheses are still allocated numbers as well as |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
| 3423 |
names. The PCRE API provides function calls for extracting the name-to- |
to capturing parentheses from other parts of the pattern, such as back- |
| 3424 |
number translation table from a compiled pattern. There is also a con- |
references, recursion, and conditions, can be made by name as well as |
| 3425 |
venience function for extracting a captured substring by name. |
by number. |
| 3426 |
|
|
| 3427 |
|
Names consist of up to 32 alphanumeric characters and underscores. |
| 3428 |
|
Named capturing parentheses are still allocated numbers as well as |
| 3429 |
|
names, exactly as if the names were not present. The PCRE API provides |
| 3430 |
|
function calls for extracting the name-to-number translation table from |
| 3431 |
|
a compiled pattern. There is also a convenience function for extracting |
| 3432 |
|
a captured substring by name. |
| 3433 |
|
|
| 3434 |
By default, a name must be unique within a pattern, but it is possible |
By default, a name must be unique within a pattern, but it is possible |
| 3435 |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
| 3439 |
both cases you want to extract the abbreviation. This pattern (ignoring |
both cases you want to extract the abbreviation. This pattern (ignoring |
| 3440 |
the line breaks) does the job: |
the line breaks) does the job: |
| 3441 |
|
|
| 3442 |
(?P<DN>Mon|Fri|Sun)(?:day)?| |
(?<DN>Mon|Fri|Sun)(?:day)?| |
| 3443 |
(?P<DN>Tue)(?:sday)?| |
(?<DN>Tue)(?:sday)?| |
| 3444 |
(?P<DN>Wed)(?:nesday)?| |
(?<DN>Wed)(?:nesday)?| |
| 3445 |
(?P<DN>Thu)(?:rsday)?| |
(?<DN>Thu)(?:rsday)?| |
| 3446 |
(?P<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
| 3447 |
|
|
| 3448 |
There are five capturing substrings, but only one is ever set after a |
There are five capturing substrings, but only one is ever set after a |
| 3449 |
match. The convenience function for extracting the data by name |
match. The convenience function for extracting the data by name |
| 3450 |
returns the substring for the first, and in this example, the only, |
returns the substring for the first (and in this example, the only) |
| 3451 |
subpattern of that name that matched. This saves searching to find |
subpattern of that name that matched. This saves searching to find |
| 3452 |
which numbered subpattern it was. If you make a reference to a non- |
which numbered subpattern it was. If you make a reference to a non- |
| 3453 |
unique named subpattern from elsewhere in the pattern, the one that |
unique named subpattern from elsewhere in the pattern, the one that |
| 3462 |
following items: |
following items: |
| 3463 |
|
|
| 3464 |
a literal data character |
a literal data character |
| 3465 |
the . metacharacter |
the dot metacharacter |
| 3466 |
the \C escape sequence |
the \C escape sequence |
| 3467 |
the \X escape sequence (in UTF-8 mode with Unicode properties) |
the \X escape sequence (in UTF-8 mode with Unicode properties) |
| 3468 |
|
the \R escape sequence |
| 3469 |
an escape such as \d that matches a single character |
an escape such as \d that matches a single character |
| 3470 |
a character class |
a character class |
| 3471 |
a back reference (see next section) |
a back reference (see next section) |
| 3505 |
The quantifier {0} is permitted, causing the expression to behave as if |
The quantifier {0} is permitted, causing the expression to behave as if |
| 3506 |
the previous item and the quantifier were not present. |
the previous item and the quantifier were not present. |
| 3507 |
|
|
| 3508 |
For convenience (and historical compatibility) the three most common |
For convenience, the three most common quantifiers have single-charac- |
| 3509 |
quantifiers have single-character abbreviations: |
ter abbreviations: |
| 3510 |
|
|
| 3511 |
* is equivalent to {0,} |
* is equivalent to {0,} |
| 3512 |
+ is equivalent to {1,} |
+ is equivalent to {1,} |
| 3558 |
which matches one digit by preference, but can match two if that is the |
which matches one digit by preference, but can match two if that is the |
| 3559 |
only way the rest of the pattern matches. |
only way the rest of the pattern matches. |
| 3560 |
|
|
| 3561 |
If the PCRE_UNGREEDY option is set (an option which is not available in |
If the PCRE_UNGREEDY option is set (an option that is not available in |
| 3562 |
Perl), the quantifiers are not greedy by default, but individual ones |
Perl), the quantifiers are not greedy by default, but individual ones |
| 3563 |
can be made greedy by following them with a question mark. In other |
can be made greedy by following them with a question mark. In other |
| 3564 |
words, it inverts the default behaviour. |
words, it inverts the default behaviour. |
| 3569 |
minimum or maximum. |
minimum or maximum. |
| 3570 |
|
|
| 3571 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- |
| 3572 |
alent to Perl's /s) is set, thus allowing the . to match newlines, the |
alent to Perl's /s) is set, thus allowing the dot to match newlines, |
| 3573 |
pattern is implicitly anchored, because whatever follows will be tried |
the pattern is implicitly anchored, because whatever follows will be |
| 3574 |
against every character position in the subject string, so there is no |
tried against every character position in the subject string, so there |
| 3575 |
point in retrying the overall match at any position after the first. |
is no point in retrying the overall match at any position after the |
| 3576 |
PCRE normally treats such a pattern as though it were preceded by \A. |
first. PCRE normally treats such a pattern as though it were preceded |
| 3577 |
|
by \A. |
| 3578 |
|
|
| 3579 |
In cases where it is known that the subject string contains no new- |
In cases where it is known that the subject string contains no new- |
| 3580 |
lines, it is worth setting PCRE_DOTALL in order to obtain this opti- |
lines, it is worth setting PCRE_DOTALL in order to obtain this opti- |
| 3581 |
mization, or alternatively using ^ to indicate anchoring explicitly. |
mization, or alternatively using ^ to indicate anchoring explicitly. |
| 3582 |
|
|
| 3583 |
However, there is one situation where the optimization cannot be used. |
However, there is one situation where the optimization cannot be used. |
| 3584 |
When .* is inside capturing parentheses that are the subject of a |
When .* is inside capturing parentheses that are the subject of a |
| 3585 |
backreference elsewhere in the pattern, a match at the start may fail, |
backreference elsewhere in the pattern, a match at the start may fail |
| 3586 |
and a later one succeed. Consider, for example: |
where a later one succeeds. Consider, for example: |
| 3587 |
|
|
| 3588 |
(.*)abc\1 |
(.*)abc\1 |
| 3589 |
|
|
| 3590 |
If the subject is "xyz123abc123" the match point is the fourth charac- |
If the subject is "xyz123abc123" the match point is the fourth charac- |
| 3591 |
ter. For this reason, such a pattern is not implicitly anchored. |
ter. For this reason, such a pattern is not implicitly anchored. |
| 3592 |
|
|
| 3593 |
When a capturing subpattern is repeated, the value captured is the sub- |
When a capturing subpattern is repeated, the value captured is the sub- |
| 3596 |
(tweedle[dume]{3}\s*)+ |
(tweedle[dume]{3}\s*)+ |
| 3597 |
|
|
| 3598 |
has matched "tweedledum tweedledee" the value of the captured substring |
has matched "tweedledum tweedledee" the value of the captured substring |
| 3599 |
is "tweedledee". However, if there are nested capturing subpatterns, |
is "tweedledee". However, if there are nested capturing subpatterns, |
| 3600 |
the corresponding captured values may have been set in previous itera- |
the corresponding captured values may have been set in previous itera- |
| 3601 |
tions. For example, after |
tions. For example, after |
| 3602 |
|
|
| 3603 |
/(a|(b))+/ |
/(a|(b))+/ |
| 3607 |
|
|
| 3608 |
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
| 3609 |
|
|
| 3610 |
With both maximizing and minimizing repetition, failure of what follows |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
| 3611 |
normally causes the repeated item to be re-evaluated to see if a dif- |
repetition, failure of what follows normally causes the repeated item |
| 3612 |
ferent number of repeats allows the rest of the pattern to match. Some- |
to be re-evaluated to see if a different number of repeats allows the |
| 3613 |
times it is useful to prevent this, either to change the nature of the |
rest of the pattern to match. Sometimes it is useful to prevent this, |
| 3614 |
match, or to cause it fail earlier than it otherwise might, when the |
either to change the nature of the match, or to cause it fail earlier |
| 3615 |
author of the pattern knows there is no point in carrying on. |
than it otherwise might, when the author of the pattern knows there is |
| 3616 |
|
no point in carrying on. |
| 3617 |
|
|
| 3618 |
Consider, for example, the pattern \d+foo when applied to the subject |
Consider, for example, the pattern \d+foo when applied to the subject |
| 3619 |
line |
line |
| 3627 |
the means for specifying that once a subpattern has matched, it is not |
the means for specifying that once a subpattern has matched, it is not |
| 3628 |
to be re-evaluated in this way. |
to be re-evaluated in this way. |
| 3629 |
|
|
| 3630 |
If we use atomic grouping for the previous example, the matcher would |
If we use atomic grouping for the previous example, the matcher gives |
| 3631 |
give up immediately on failing to match "foo" the first time. The nota- |
up immediately on failing to match "foo" the first time. The notation |
| 3632 |
tion is a kind of special parenthesis, starting with (?> as in this |
is a kind of special parenthesis, starting with (?> as in this example: |
|
example: |
|
| 3633 |
|
|
| 3634 |
(?>\d+)foo |
(?>\d+)foo |
| 3635 |
|
|
| 3661 |
Possessive quantifiers are always greedy; the setting of the |
Possessive quantifiers are always greedy; the setting of the |
| 3662 |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
| 3663 |
simpler forms of atomic group. However, there is no difference in the |
simpler forms of atomic group. However, there is no difference in the |
| 3664 |
meaning or processing of a possessive quantifier and the equivalent |
meaning of a possessive quantifier and the equivalent atomic group, |
| 3665 |
atomic group. |
though there may be a performance difference; possessive quantifiers |
| 3666 |
|
should be slightly faster. |
| 3667 |
The possessive quantifier syntax is an extension to the Perl syntax. |
|
| 3668 |
Jeffrey Friedl originated the idea (and the name) in the first edition |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
| 3669 |
of his book. Mike McCloskey liked it, so implemented it when he built |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
| 3670 |
Sun's Java package, and PCRE copied it from there. |
edition of his book. Mike McCloskey liked it, so implemented it when he |
| 3671 |
|
built Sun's Java package, and PCRE copied it from there. It ultimately |
| 3672 |
When a pattern contains an unlimited repeat inside a subpattern that |
found its way into Perl at release 5.10. |
| 3673 |
can itself be repeated an unlimited number of times, the use of an |
|
| 3674 |
atomic group is the only way to avoid some failing matches taking a |
PCRE has an optimization that automatically "possessifies" certain sim- |
| 3675 |
|
ple pattern constructs. For example, the sequence A+B is treated as |
| 3676 |
|
A++B because there is no point in backtracking into a sequence of A's |
| 3677 |
|
when B must follow. |
| 3678 |
|
|
| 3679 |
|
When a pattern contains an unlimited repeat inside a subpattern that |
| 3680 |
|
can itself be repeated an unlimited number of times, the use of an |
| 3681 |
|
atomic group is the only way to avoid some failing matches taking a |
| 3682 |
very long time indeed. The pattern |
very long time indeed. The pattern |
| 3683 |
|
|
| 3684 |
(\D+|<\d+>)*[!?] |
(\D+|<\d+>)*[!?] |
| 3685 |
|
|
| 3686 |
matches an unlimited number of substrings that either consist of non- |
matches an unlimited number of substrings that either consist of non- |
| 3687 |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
| 3688 |
matches, it runs quickly. However, if it is applied to |
matches, it runs quickly. However, if it is applied to |
| 3689 |
|
|
| 3690 |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
| 3691 |
|
|
| 3692 |
it takes a long time before reporting failure. This is because the |
it takes a long time before reporting failure. This is because the |
| 3693 |
string can be divided between the internal \D+ repeat and the external |
string can be divided between the internal \D+ repeat and the external |
| 3694 |
* repeat in a large number of ways, and all have to be tried. (The |
* repeat in a large number of ways, and all have to be tried. (The |
| 3695 |
example uses [!?] rather than a single character at the end, because |
example uses [!?] rather than a single character at the end, because |
| 3696 |
both PCRE and Perl have an optimization that allows for fast failure |
both PCRE and Perl have an optimization that allows for fast failure |
| 3697 |
when a single character is used. They remember the last single charac- |
when a single character is used. They remember the last single charac- |
| 3698 |
ter that is required for a match, and fail early if it is not present |
ter that is required for a match, and fail early if it is not present |
| 3699 |
in the string.) If the pattern is changed so that it uses an atomic |
in the string.) If the pattern is changed so that it uses an atomic |
| 3700 |
group, like this: |
group, like this: |
| 3701 |
|
|
| 3702 |
((?>\D+)|<\d+>)*[!?] |
((?>\D+)|<\d+>)*[!?] |
| 3703 |
|
|
| 3704 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
| 3705 |
|
|
| 3706 |
|
|
| 3707 |
BACK REFERENCES |
BACK REFERENCES |
| 3708 |
|
|
| 3709 |
Outside a character class, a backslash followed by a digit greater than |
Outside a character class, a backslash followed by a digit greater than |
| 3710 |
0 (and possibly further digits) is a back reference to a capturing sub- |
0 (and possibly further digits) is a back reference to a capturing sub- |
| 3711 |
pattern earlier (that is, to its left) in the pattern, provided there |
pattern earlier (that is, to its left) in the pattern, provided there |
| 3712 |
have been that many previous capturing left parentheses. |
have been that many previous capturing left parentheses. |
| 3713 |
|
|
| 3714 |
However, if the decimal number following the backslash is less than 10, |
However, if the decimal number following the backslash is less than 10, |
| 3715 |
it is always taken as a back reference, and causes an error only if |
it is always taken as a back reference, and causes an error only if |
| 3716 |
there are not that many capturing left parentheses in the entire pat- |
there are not that many capturing left parentheses in the entire pat- |
| 3717 |
tern. In other words, the parentheses that are referenced need not be |
tern. In other words, the parentheses that are referenced need not be |
| 3718 |
to the left of the reference for numbers less than 10. A "forward back |
to the left of the reference for numbers less than 10. A "forward back |
| 3719 |
reference" of this type can make sense when a repetition is involved |
reference" of this type can make sense when a repetition is involved |
| 3720 |
and the subpattern to the right has participated in an earlier itera- |
and the subpattern to the right has participated in an earlier itera- |
| 3721 |
tion. |
tion. |
| 3722 |
|
|
| 3723 |
It is not possible to have a numerical "forward back reference" to sub- |
It is not possible to have a numerical "forward back reference" to a |
| 3724 |
pattern whose number is 10 or more. However, a back reference to any |
subpattern whose number is 10 or more using this syntax because a |
| 3725 |
subpattern is possible using named parentheses (see below). See also |
sequence such as \50 is interpreted as a character defined in octal. |
| 3726 |
the subsection entitled "Non-printing characters" above for further |
See the subsection entitled "Non-printing characters" above for further |
| 3727 |
details of the handling of digits following a backslash. |
details of the handling of digits following a backslash. There is no |
| 3728 |
|
such problem when named parentheses are used. A back reference to any |
| 3729 |
|
subpattern is possible using named parentheses (see below). |
| 3730 |
|
|
| 3731 |
|
Another way of avoiding the ambiguity inherent in the use of digits |
| 3732 |
|
following a backslash is to use the \g escape sequence, which is a fea- |
| 3733 |
|
ture introduced in Perl 5.10. This escape must be followed by a posi- |
| 3734 |
|
tive or a negative number, optionally enclosed in braces. These exam- |
| 3735 |
|
ples are all identical: |
| 3736 |
|
|
| 3737 |
|
(ring), \1 |
| 3738 |
|
(ring), \g1 |
| 3739 |
|
(ring), \g{1} |
| 3740 |
|
|
| 3741 |
|
A positive number specifies an absolute reference without the ambiguity |
| 3742 |
|
that is present in the older syntax. It is also useful when literal |
| 3743 |
|
digits follow the reference. A negative number is a relative reference. |
| 3744 |
|
Consider this example: |
| 3745 |
|
|
| 3746 |
|
(abc(def)ghi)\g{-1} |
| 3747 |
|
|
| 3748 |
|
The sequence \g{-1} is a reference to the most recently started captur- |
| 3749 |
|
ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
| 3750 |
|
\g{-2} would be equivalent to \1. The use of relative references can be |
| 3751 |
|
helpful in long patterns, and also in patterns that are created by |
| 3752 |
|
joining together fragments that contain references within themselves. |
| 3753 |
|
|
| 3754 |
A back reference matches whatever actually matched the capturing sub- |
A back reference matches whatever actually matched the capturing sub- |
| 3755 |
pattern in the current subject string, rather than anything matching |
pattern in the current subject string, rather than anything matching |
| 3768 |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
| 3769 |
original capturing subpattern is matched caselessly. |
original capturing subpattern is matched caselessly. |
| 3770 |
|
|
| 3771 |
Back references to named subpatterns use the Python syntax (?P=name). |
Back references to named subpatterns use the Perl syntax \k<name> or |
| 3772 |
We could rewrite the above example as follows: |
\k'name' or the Python syntax (?P=name). We could rewrite the above |
| 3773 |
|
example in either of the following ways: |
| 3774 |
|
|
| 3775 |
|
(?<p1>(?i)rah)\s+\k<p1> |
| 3776 |
(?P<p1>(?i)rah)\s+(?P=p1) |
(?P<p1>(?i)rah)\s+(?P=p1) |
| 3777 |
|
|
| 3778 |
A subpattern that is referenced by name may appear in the pattern |
A subpattern that is referenced by name may appear in the pattern |
| 3779 |
before or after the reference. |
before or after the reference. |
| 3780 |
|
|
| 3781 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
| 3782 |
subpattern has not actually been used in a particular match, any back |
subpattern has not actually been used in a particular match, any back |
| 3783 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
| 3784 |
|
|
| 3785 |
(a|(bc))\2 |
(a|(bc))\2 |
| 3786 |
|
|
| 3787 |
always fails if it starts to match "a" rather than "bc". Because there |
always fails if it starts to match "a" rather than "bc". Because there |
| 3788 |
may be many capturing parentheses in a pattern, all digits following |
may be many capturing parentheses in a pattern, all digits following |
| 3789 |
the backslash are taken as part of a potential back reference number. |
the backslash are taken as part of a potential back reference number. |
| 3790 |
If the pattern continues with a digit character, some delimiter must be |
If the pattern continues with a digit character, some delimiter must be |
| 3791 |
used to terminate the back reference. If the PCRE_EXTENDED option is |
used to terminate the back reference. If the PCRE_EXTENDED option is |
| 3792 |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
| 3793 |
ments" below) can be used. |
ments" below) can be used. |
| 3794 |
|
|
| 3795 |
A back reference that occurs inside the parentheses to which it refers |
A back reference that occurs inside the parentheses to which it refers |
| 3796 |
fails when the subpattern is first used, so, for example, (a\1) never |
fails when the subpattern is first used, so, for example, (a\1) never |
| 3797 |
matches. However, such references can be useful inside repeated sub- |
matches. However, such references can be useful inside repeated sub- |
| 3798 |
patterns. For example, the pattern |
patterns. For example, the pattern |
| 3799 |
|
|
| 3800 |
(a|b\1)+ |
(a|b\1)+ |
| 3801 |
|
|
| 3802 |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
| 3803 |
ation of the subpattern, the back reference matches the character |
ation of the subpattern, the back reference matches the character |
| 3804 |
string corresponding to the previous iteration. In order for this to |
string corresponding to the previous iteration. In order for this to |
| 3805 |
work, the pattern must be such that the first iteration does not need |
work, the pattern must be such that the first iteration does not need |
| 3806 |
to match the back reference. This can be done using alternation, as in |
to match the back reference. This can be done using alternation, as in |
| 3807 |
the example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of zero. |
| 3808 |
|
|
| 3809 |
|
|
| 3810 |
ASSERTIONS |
ASSERTIONS |
| 3811 |
|
|
| 3812 |
An assertion is a test on the characters following or preceding the |
An assertion is a test on the characters following or preceding the |
| 3813 |
current matching point that does not actually consume any characters. |
current matching point that does not actually consume any characters. |
| 3814 |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
| 3815 |
described above. |
described above. |
| 3816 |
|
|
| 3817 |
More complicated assertions are coded as subpatterns. There are two |
More complicated assertions are coded as subpatterns. There are two |
| 3818 |
kinds: those that look ahead of the current position in the subject |
kinds: those that look ahead of the current position in the subject |
| 3819 |
string, and those that look behind it. An assertion subpattern is |
string, and those that look behind it. An assertion subpattern is |
| 3820 |
matched in the normal way, except that it does not cause the current |
matched in the normal way, except that it does not cause the current |
| 3821 |
matching position to be changed. |
matching position to be changed. |
| 3822 |
|
|
| 3823 |
Assertion subpatterns are not capturing subpatterns, and may not be |
Assertion subpatterns are not capturing subpatterns, and may not be |
| 3824 |
repeated, because it makes no sense to assert the same thing several |
repeated, because it makes no sense to assert the same thing several |
| 3825 |
times. If any kind of assertion contains capturing subpatterns within |
times. If any kind of assertion contains capturing subpatterns within |
| 3826 |
it, these are counted for the purposes of numbering the capturing sub- |
it, these are counted for the purposes of numbering the capturing sub- |
| 3827 |
patterns in the whole pattern. However, substring capturing is carried |
patterns in the whole pattern. However, substring capturing is carried |
| 3828 |
out only for positive assertions, because it does not make sense for |
out only for positive assertions, because it does not make sense for |
| 3829 |
negative assertions. |
negative assertions. |
| 3830 |
|
|
| 3831 |
Lookahead assertions |
Lookahead assertions |
| 3835 |
|
|
| 3836 |
\w+(?=;) |
\w+(?=;) |
| 3837 |
|
|
| 3838 |
matches a word followed by a semicolon, but does not include the semi- |
matches a word followed by a semicolon, but does not include the semi- |
| 3839 |
colon in the match, and |
colon in the match, and |
| 3840 |
|
|
| 3841 |
foo(?!bar) |
foo(?!bar) |
| 3842 |
|
|
| 3843 |
matches any occurrence of "foo" that is not followed by "bar". Note |
matches any occurrence of "foo" that is not followed by "bar". Note |
| 3844 |
that the apparently similar pattern |
that the apparently similar pattern |
| 3845 |
|
|
| 3846 |
(?!foo)bar |
(?!foo)bar |
| 3847 |
|
|
| 3848 |
does not find an occurrence of "bar" that is preceded by something |
does not find an occurrence of "bar" that is preceded by something |
| 3849 |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
| 3850 |
the assertion (?!foo) is always true when the next three characters are |
the assertion (?!foo) is always true when the next three characters are |
| 3851 |
"bar". A lookbehind assertion is needed to achieve the other effect. |
"bar". A lookbehind assertion is needed to achieve the other effect. |
| 3852 |
|
|
| 3853 |
If you want to force a matching failure at some point in a pattern, the |
If you want to force a matching failure at some point in a pattern, the |
| 3854 |
most convenient way to do it is with (?!) because an empty string |
most convenient way to do it is with (?!) because an empty string |
| 3855 |
always matches, so an assertion that requires there not to be an empty |
always matches, so an assertion that requires there not to be an empty |
| 3856 |
string must always fail. |
string must always fail. |
| 3857 |
|
|
| 3858 |
Lookbehind assertions |
Lookbehind assertions |
| 3859 |
|
|
| 3860 |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
| 3861 |
for negative assertions. For example, |
for negative assertions. For example, |
| 3862 |
|
|
| 3863 |
(?<!foo)bar |
(?<!foo)bar |
| 3864 |
|
|
| 3865 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
| 3866 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
| 3867 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
| 3868 |
eral top-level alternatives, they do not all have to have the same |
eral top-level alternatives, they do not all have to have the same |
| 3869 |
fixed length. Thus |
fixed length. Thus |
| 3870 |
|
|
| 3871 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
| 3874 |
|
|
| 3875 |
(?<!dogs?|cats?) |
(?<!dogs?|cats?) |
| 3876 |
|
|
| 3877 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
| 3878 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
| 3879 |
This is an extension compared with Perl (at least for 5.8), which |
This is an extension compared with Perl (at least for 5.8), which |
| 3880 |
requires all branches to match the same length of string. An assertion |
requires all branches to match the same length of string. An assertion |
| 3881 |
such as |
such as |
| 3882 |
|
|
| 3883 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
| 3884 |
|
|
| 3885 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
| 3886 |
different lengths, but it is acceptable if rewritten to use two top- |
different lengths, but it is acceptable if rewritten to use two top- |
| 3887 |
level branches: |
level branches: |
| 3888 |
|
|
| 3889 |
(?<=abc|abde) |
(?<=abc|abde) |
| 3890 |
|
|
| 3891 |
The implementation of lookbehind assertions is, for each alternative, |
The implementation of lookbehind assertions is, for each alternative, |
| 3892 |
to temporarily move the current position back by the fixed width and |
to temporarily move the current position back by the fixed length and |
| 3893 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
| 3894 |
rent position, the match is deemed to fail. |
rent position, the assertion fails. |
| 3895 |
|
|
| 3896 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
| 3897 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
mode) to appear in lookbehind assertions, because it makes it impossi- |
| 3898 |
ble to calculate the length of the lookbehind. The \X escape, which can |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
| 3899 |
match different numbers of bytes, is also not permitted. |
which can match different numbers of bytes, are also not permitted. |
| 3900 |
|
|
| 3901 |
Atomic groups can be used in conjunction with lookbehind assertions to |
Possessive quantifiers can be used in conjunction with lookbehind |
| 3902 |
specify efficient matching at the end of the subject string. Consider a |
assertions to specify efficient matching at the end of the subject |
| 3903 |
simple pattern such as |
string. Consider a simple pattern such as |
| 3904 |
|
|
| 3905 |
abcd$ |
abcd$ |
| 3906 |
|
|
| 3907 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
| 3908 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
| 3909 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
| 3910 |
pattern is specified as |
pattern is specified as |
| 3911 |
|
|
| 3912 |
^.*abcd$ |
^.*abcd$ |
| 3913 |
|
|
| 3914 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
| 3915 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
| 3916 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
| 3917 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
| 3918 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
| 3919 |
|
|
|
^(?>.*)(?<=abcd) |
|
|
|
|
|
or, equivalently, using the possessive quantifier syntax, |
|
|
|
|
| 3920 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
| 3921 |
|
|
| 3922 |
there can be no backtracking for the .* item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
| 3923 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
| 3924 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
| 3925 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
| 3926 |
processing time. |
processing time. |
| 3927 |
|
|
| 3928 |
Using multiple assertions |
Using multiple assertions |
| 3931 |
|
|
| 3932 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
| 3933 |
|
|
| 3934 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
| 3935 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
| 3936 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
| 3937 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
| 3938 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
| 3939 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
| 3940 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
| 3941 |
foo". A pattern to do that is |
foo". A pattern to do that is |
| 3942 |
|
|
| 3943 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
| 3944 |
|
|
| 3945 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
| 3946 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
| 3947 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
| 3948 |
|
|
| 3950 |
|
|
| 3951 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
| 3952 |
|
|
| 3953 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
| 3954 |
is not preceded by "foo", while |
is not preceded by "foo", while |
| 3955 |
|
|
| 3956 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
| 3957 |
|
|
| 3958 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
| 3959 |
three characters that are not "999". |
three characters that are not "999". |
| 3960 |
|
|
| 3961 |
|
|
| 3962 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
| 3963 |
|
|
| 3964 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
| 3965 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
| 3966 |
on the result of an assertion, or whether a previous capturing subpat- |
on the result of an assertion, or whether a previous capturing subpat- |
| 3967 |
tern matched or not. The two possible forms of conditional subpattern |
tern matched or not. The two possible forms of conditional subpattern |
| 3968 |
are |
are |
| 3969 |
|
|
| 3970 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
| 3971 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
| 3972 |
|
|
| 3973 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
| 3974 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
| 3975 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
| 3976 |
|
|
| 3977 |
There are three kinds of condition. If the text between the parentheses |
There are four kinds of condition: references to subpatterns, refer- |
| 3978 |
consists of a sequence of digits, or a sequence of alphanumeric charac- |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
| 3979 |
ters and underscores, the condition is satisfied if the capturing sub- |
|
| 3980 |
pattern of that number or name has previously matched. There is a pos- |
Checking for a used subpattern by number |
| 3981 |
sible ambiguity here, because subpattern names may consist entirely of |
|
| 3982 |
digits. PCRE looks first for a named subpattern; if it cannot find one |
If the text between the parentheses consists of a sequence of digits, |
| 3983 |
and the text consists entirely of digits, it looks for a subpattern of |
the condition is true if the capturing subpattern of that number has |
| 3984 |
that number, which must be greater than zero. Using subpattern names |
previously matched. |
|
that consist entirely of digits is not recommended. |
|
| 3985 |
|
|
| 3986 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
| 3987 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
| 3998 |
tern is executed and a closing parenthesis is required. Otherwise, |
tern is executed and a closing parenthesis is required. Otherwise, |
| 3999 |
since no-pattern is not present, the subpattern matches nothing. In |
since no-pattern is not present, the subpattern matches nothing. In |
| 4000 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
| 4001 |
optionally enclosed in parentheses. Rewriting it to use a named subpat- |
optionally enclosed in parentheses. |
| 4002 |
tern gives this: |
|
| 4003 |
|
Checking for a used subpattern by name |
| 4004 |
|
|
| 4005 |
|
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
| 4006 |
|
used subpattern by name. For compatibility with earlier versions of |
| 4007 |
|
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
| 4008 |
|
also recognized. However, there is a possible ambiguity with this syn- |
| 4009 |
|
tax, because subpattern names may consist entirely of digits. PCRE |
| 4010 |
|
looks first for a named subpattern; if it cannot find one and the name |
| 4011 |
|
consists entirely of digits, PCRE looks for a subpattern of that num- |
| 4012 |
|
ber, which must be greater than zero. Using subpattern names that con- |
| 4013 |
|
sist entirely of digits is not recommended. |
| 4014 |
|
|
| 4015 |
|
Rewriting the above example to use a named subpattern gives this: |
| 4016 |
|
|
| 4017 |
(?P<OPEN> \( )? [^()]+ (?(OPEN) \) ) |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
| 4018 |
|
|
| 4019 |
|
|
| 4020 |
|
Checking for pattern recursion |
| 4021 |
|
|
| 4022 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
| 4023 |
name R, the condition is satisfied if a recursive call to the pattern |
name R, the condition is true if a recursive call to the whole pattern |
| 4024 |
or subpattern has been made. At "top level", the condition is false. |
or any subpattern has been made. If digits or a name preceded by amper- |
| 4025 |
This is a PCRE extension. Recursive patterns are described in the next |
sand follow the letter R, for example: |
| 4026 |
section. |
|
| 4027 |
|
(?(R3)...) or (?(R&name)...) |
| 4028 |
|
|
| 4029 |
|
the condition is true if the most recent recursion is into the subpat- |
| 4030 |
|
tern whose number or name is given. This condition does not check the |
| 4031 |
|
entire recursion stack. |
| 4032 |
|
|
| 4033 |
|
At "top level", all these recursion test conditions are false. Recur- |
| 4034 |
|
sive patterns are described below. |
| 4035 |
|
|
| 4036 |
|
Defining subpatterns for use by reference only |
| 4037 |
|
|
| 4038 |
|
If the condition is the string (DEFINE), and there is no subpattern |
| 4039 |
|
with the name DEFINE, the condition is always false. In this case, |
| 4040 |
|
there may be only one alternative in the subpattern. It is always |
| 4041 |
|
skipped if control reaches this point in the pattern; the idea of |
| 4042 |
|
DEFINE is that it can be used to define "subroutines" that can be ref- |
| 4043 |
|
erenced from elsewhere. (The use of "subroutines" is described below.) |
| 4044 |
|
For example, a pattern to match an IPv4 address could be written like |
| 4045 |
|
this (ignore whitespace and line breaks): |
| 4046 |
|
|
| 4047 |
|
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| 4048 |
|
\b (?&byte) (\.(?&byte)){3} \b |
| 4049 |
|
|
| 4050 |
|
The first part of the pattern is a DEFINE group inside which a another |
| 4051 |
|
group named "byte" is defined. This matches an individual component of |
| 4052 |
|
an IPv4 address (a number less than 256). When matching takes place, |
| 4053 |
|
this part of the pattern is skipped because DEFINE acts like a false |
| 4054 |
|
condition. |
| 4055 |
|
|
| 4056 |
|
The rest of the pattern uses references to the named group to match the |
| 4057 |
|
four dot-separated components of an IPv4 address, insisting on a word |
| 4058 |
|
boundary at each end. |
| 4059 |
|
|
| 4060 |
|
Assertion conditions |
| 4061 |
|
|
| 4062 |
If the condition is not a sequence of digits or (R), it must be an |
If the condition is not in any of the above formats, it must be an |
| 4063 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
| 4064 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
| 4065 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
| 4094 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
| 4095 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
| 4096 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
| 4097 |
depth. Perl provides a facility that allows regular expressions to |
depth. |
| 4098 |
recurse (amongst other things). It does this by interpolating Perl code |
|
| 4099 |
in the expression at run time, and the code can refer to the expression |
For some time, Perl has provided a facility that allows regular expres- |
| 4100 |
itself. A Perl pattern to solve the parentheses problem can be created |
sions to recurse (amongst other things). It does this by interpolating |
| 4101 |
like this: |
Perl code in the expression at run time, and the code can refer to the |
| 4102 |
|
expression itself. A Perl pattern using code interpolation to solve the |
| 4103 |
|
parentheses problem can be created like this: |
| 4104 |
|
|
| 4105 |
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
| 4106 |
|
|
| 4107 |
The (?p{...}) item interpolates Perl code at run time, and in this case |
The (?p{...}) item interpolates Perl code at run time, and in this case |
| 4108 |
refers recursively to the pattern in which it appears. Obviously, PCRE |
refers recursively to the pattern in which it appears. |
| 4109 |
cannot support the interpolation of Perl code. Instead, it supports |
|
| 4110 |
some special syntax for recursion of the entire pattern, and also for |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
| 4111 |
individual subpattern recursion. |
it supports special syntax for recursion of the entire pattern, and |
| 4112 |
|
also for individual subpattern recursion. After its introduction in |
| 4113 |
|
PCRE and Python, this kind of recursion was introduced into Perl at |
| 4114 |
|
release 5.10. |
| 4115 |
|
|
| 4116 |
The special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
| 4117 |
zero and a closing parenthesis is a recursive call of the subpattern of |
zero and a closing parenthesis is a recursive call of the subpattern of |
| 4118 |
the given number, provided that it occurs inside that subpattern. (If |
the given number, provided that it occurs inside that subpattern. (If |
| 4119 |
not, it is a "subroutine" call, which is described in the next sec- |
not, it is a "subroutine" call, which is described in the next sec- |
| 4120 |
tion.) The special item (?R) is a recursive call of the entire regular |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
| 4121 |
expression. |
regular expression. |
| 4122 |
|
|
| 4123 |
A recursive subpattern call is always treated as an atomic group. That |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
| 4124 |
is, once it has matched some of the subject string, it is never re- |
always treated as an atomic group. That is, once it has matched some of |
| 4125 |
entered, even if it contains untried alternatives and there is a subse- |
the subject string, it is never re-entered, even if it contains untried |
| 4126 |
quent matching failure. |
alternatives and there is a subsequent matching failure. |
| 4127 |
|
|
| 4128 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
| 4129 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
| 4130 |
|
|
| 4131 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
| 4132 |
|
|
| 4133 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
| 4134 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
| 4135 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
| 4136 |
sized substring). Finally there is a closing parenthesis. |
sized substring). Finally there is a closing parenthesis. |
| 4137 |
|
|
| 4138 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
| 4139 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
| 4140 |
|
|
| 4141 |
( \( ( (?>[^()]+) | (?1) )* \) ) |
( \( ( (?>[^()]+) | (?1) )* \) ) |
| 4142 |
|
|
| 4143 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
| 4144 |
refer to them instead of the whole pattern. In a larger pattern, keep- |
refer to them instead of the whole pattern. In a larger pattern, keep- |
| 4145 |
ing track of parenthesis numbers can be tricky. It may be more conve- |
ing track of parenthesis numbers can be tricky. It may be more conve- |
| 4146 |
nient to use named parentheses instead. For this, PCRE uses (?P>name), |
nient to use named parentheses instead. The Perl syntax for this is |
| 4147 |
which is an extension to the Python syntax that PCRE uses for named |
(?&name); PCRE's earlier syntax (?P>name) is also supported. We could |
| 4148 |
parentheses (Perl does not provide named parentheses). We could rewrite |
rewrite the above example as follows: |
| 4149 |
the above example as follows: |
|
| 4150 |
|
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
| 4151 |
(?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) ) |
|
| 4152 |
|
If there is more than one subpattern with the same name, the earliest |
| 4153 |
This particular example pattern contains nested unlimited repeats, and |
one is used. This particular example pattern contains nested unlimited |
| 4154 |
so the use of atomic grouping for matching strings of non-parentheses |
repeats, and so the use of atomic grouping for matching strings of non- |
| 4155 |
is important when applying the pattern to strings that do not match. |
parentheses is important when applying the pattern to strings that do |
| 4156 |
For example, when this pattern is applied to |
not match. For example, when this pattern is applied to |
| 4157 |
|
|
| 4158 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| 4159 |
|
|
| 4160 |
it yields "no match" quickly. However, if atomic grouping is not used, |
it yields "no match" quickly. However, if atomic grouping is not used, |
| 4161 |
the match runs for a very long time indeed because there are so many |
the match runs for a very long time indeed because there are so many |
| 4162 |
different ways the + and * repeats can carve up the subject, and all |
different ways the + and * repeats can carve up the subject, and all |
| 4163 |
have to be tested before failure can be reported. |
have to be tested before failure can be reported. |
| 4164 |
|
|
| 4165 |
At the end of a match, the values set for any capturing subpatterns are |
At the end of a match, the values set for any capturing subpatterns are |
| 4166 |
those from the outermost level of the recursion at which the subpattern |
those from the outermost level of the recursion at which the subpattern |
| 4167 |
value is set. If you want to obtain intermediate values, a callout |
value is set. If you want to obtain intermediate values, a callout |
| 4168 |
function can be used (see the next section and the pcrecallout documen- |
function can be used (see below and the pcrecallout documentation). If |
| 4169 |
tation). If the pattern above is matched against |
the pattern above is matched against |
| 4170 |
|
|
| 4171 |
(ab(cd)ef) |
(ab(cd)ef) |
| 4172 |
|
|
| 4173 |
the value for the capturing parentheses is "ef", which is the last |
the value for the capturing parentheses is "ef", which is the last |
| 4174 |
value taken on at the top level. If additional parentheses are added, |
value taken on at the top level. If additional parentheses are added, |
| 4175 |
giving |
giving |
| 4176 |
|
|
| 4177 |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
| 4178 |
^ ^ |
^ ^ |
| 4179 |
^ ^ |
^ ^ |
| 4180 |
|
|
| 4181 |
the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
| 4182 |
parentheses. If there are more than 15 capturing parentheses in a pat- |
parentheses. If there are more than 15 capturing parentheses in a pat- |
| 4183 |
tern, PCRE has to obtain extra memory to store data during a recursion, |
tern, PCRE has to obtain extra memory to store data during a recursion, |
| 4184 |
which it does by using pcre_malloc, freeing it via pcre_free after- |
which it does by using pcre_malloc, freeing it via pcre_free after- |
| 4185 |
wards. If no memory can be obtained, the match fails with the |
wards. If no memory can be obtained, the match fails with the |
| 4186 |
PCRE_ERROR_NOMEMORY error. |
PCRE_ERROR_NOMEMORY error. |
| 4187 |
|
|
| 4188 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
| 4189 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
| 4190 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
| 4191 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
| 4192 |
ted at the outer level. |
ted at the outer level. |
| 4193 |
|
|
| 4194 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
| 4195 |
|
|
| 4196 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
| 4197 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
| 4198 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
| 4199 |
|
|
| 4200 |
|
|
| 4201 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
| 4202 |
|
|
| 4203 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
| 4204 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
| 4205 |
ates like a subroutine in a programming language. An earlier example |
ates like a subroutine in a programming language. The "called" subpat- |
| 4206 |
|
tern may be defined before or after the reference. An earlier example |
| 4207 |
pointed out that the pattern |
pointed out that the pattern |
| 4208 |
|
|
| 4209 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
| 4214 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
| 4215 |
|
|
| 4216 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
| 4217 |
two strings. Such references, if given numerically, must follow the |
two strings. Another example is given in the discussion of DEFINE |
| 4218 |
subpattern to which they refer. However, named references can refer to |
above. |
|
later subpatterns. |
|
| 4219 |
|
|
| 4220 |
Like recursive subpatterns, a "subroutine" call is always treated as an |
Like recursive subpatterns, a "subroutine" call is always treated as an |
| 4221 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
| 4222 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
| 4223 |
there is a subsequent matching failure. |
there is a subsequent matching failure. |
| 4224 |
|
|
| 4225 |
|
When a subpattern is used as a subroutine, processing options such as |
| 4226 |
|
case-independence are fixed when the subpattern is defined. They cannot |
| 4227 |
|
be changed for different calls. For example, consider this pattern: |
| 4228 |
|
|
| 4229 |
|
(abc)(?i:(?1)) |
| 4230 |
|
|
| 4231 |
|
It matches "abcabc". It does not match "abcABC" because the change of |
| 4232 |
|
processing option does not affect the called subpattern. |
| 4233 |
|
|
| 4234 |
|
|
| 4235 |
CALLOUTS |
CALLOUTS |
| 4236 |
|
|
| 4266 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
| 4267 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
| 4268 |
|
|
| 4269 |
Last updated: 06 June 2006 |
|
| 4270 |
|
SEE ALSO |
| 4271 |
|
|
| 4272 |
|
pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). |
| 4273 |
|
|
| 4274 |
|
Last updated: 06 December 2006 |
| 4275 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 4276 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4277 |
|
|
| 4377 |
The first data string is matched completely, so pcretest shows the |
The first data string is matched completely, so pcretest shows the |
| 4378 |
matched substrings. The remaining four strings do not match the com- |
matched substrings. The remaining four strings do not match the com- |
| 4379 |
plete pattern, but the first two are partial matches. The same test, |
plete pattern, but the first two are partial matches. The same test, |
| 4380 |
using DFA matching (by means of the \D escape sequence), produces the |
using pcre_dfa_exec() matching (by means of the \D escape sequence), |
| 4381 |
following output: |
produces the following output: |
| 4382 |
|
|
| 4383 |
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ |
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ |
| 4384 |
data> 25jun04\P\D |
data> 25jun04\P\D |
| 4400 |
|
|
| 4401 |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
| 4402 |
ble to continue the match by providing additional subject data and |
ble to continue the match by providing additional subject data and |
| 4403 |
calling pcre_dfa_exec() again with the PCRE_DFA_RESTART option and the |
calling pcre_dfa_exec() again with the same compiled regular expres- |
| 4404 |
same working space (where details of the previous partial match are |
sion, this time setting the PCRE_DFA_RESTART option. You must also pass |
| 4405 |
stored). Here is an example using pcretest, where the \R escape |
the same working space as before, because this is where details of the |
| 4406 |
sequence sets the PCRE_DFA_RESTART option and the \D escape sequence |
previous partial match are stored. Here is an example using pcretest, |
| 4407 |
requests the use of pcre_dfa_exec(): |
using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and |
| 4408 |
|
\D are as above): |
| 4409 |
|
|
| 4410 |
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ |
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ |
| 4411 |
data> 23ja\P\D |
data> 23ja\P\D |
| 4413 |
data> n05\R\D |
data> n05\R\D |
| 4414 |
0: n05 |
0: n05 |
| 4415 |
|
|
| 4416 |
The first call has "23ja" as the subject, and requests partial match- |
The first call has "23ja" as the subject, and requests partial match- |
| 4417 |
ing; the second call has "n05" as the subject for the continued |
ing; the second call has "n05" as the subject for the continued |
| 4418 |
(restarted) match. Notice that when the match is complete, only the |
(restarted) match. Notice that when the match is complete, only the |
| 4419 |
last part is shown; PCRE does not retain the previously partially- |
last part is shown; PCRE does not retain the previously partially- |
| 4420 |
matched string. It is up to the calling program to do that if it needs |
matched string. It is up to the calling program to do that if it needs |
| 4421 |
to. |
to. |
| 4422 |
|
|
| 4423 |
This facility can be used to pass very long subject strings to |
You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial |
| 4424 |
pcre_dfa_exec(). However, some care is needed for certain types of pat- |
matching over multiple segments. This facility can be used to pass very |
| 4425 |
tern. |
long subject strings to pcre_dfa_exec(). However, some care is needed |
| 4426 |
|
for certain types of pattern. |
| 4427 |
|
|
| 4428 |
1. If the pattern contains tests for the beginning or end of a line, |
1. If the pattern contains tests for the beginning or end of a line, |
| 4429 |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
| 4432 |
|
|
| 4433 |
2. If the pattern contains backward assertions (including \b or \B), |
2. If the pattern contains backward assertions (including \b or \B), |
| 4434 |
you need to arrange for some overlap in the subject strings to allow |
you need to arrange for some overlap in the subject strings to allow |
| 4435 |
for this. For example, you could pass the subject in chunks that were |
for this. For example, you could pass the subject in chunks that are |
| 4436 |
500 bytes long, but in a buffer of 700 bytes, with the starting offset |
500 bytes long, but in a buffer of 700 bytes, with the starting offset |
| 4437 |
set to 200 and the previous 200 bytes at the start of the buffer. |
set to 200 and the previous 200 bytes at the start of the buffer. |
| 4438 |
|
|
| 4482 |
|
|
| 4483 |
where no string can be a partial match for both alternatives. |
where no string can be a partial match for both alternatives. |
| 4484 |
|
|
| 4485 |
Last updated: 16 January 2006 |
Last updated: 30 November 2006 |
| 4486 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 4487 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4488 |
|
|
| 4594 |
makes up a compiled pattern was changed for release 5.0. If you have |
makes up a compiled pattern was changed for release 5.0. If you have |
| 4595 |
any saved patterns that were compiled with previous releases (not a |
any saved patterns that were compiled with previous releases (not a |
| 4596 |
facility that was previously advertised), you will have to recompile |
facility that was previously advertised), you will have to recompile |
| 4597 |
them for release 5.0. However, from now on, it should be possible to |
them for release 5.0 and above. |
|
make changes in a compatible manner. |
|
| 4598 |
|
|
| 4599 |
Notwithstanding the above, if you have any saved patterns in UTF-8 mode |
If you have any saved patterns in UTF-8 mode that use \p or \P that |
| 4600 |
that use \p or \P that were compiled with any release up to and includ- |
were compiled with any release up to and including 6.4, you will have |
| 4601 |
ing 6.4, you will have to recompile them for release 6.5 and above. |
to recompile them for release 6.5 and above. |
| 4602 |
|
|
| 4603 |
Last updated: 01 February 2006 |
All saved patterns from earlier releases must be recompiled for release |
| 4604 |
|
7.0 or higher, because there was an internal reorganization at that |
| 4605 |
|
release. |
| 4606 |
|
|
| 4607 |
|
Last updated: 28 November 2006 |
| 4608 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 4609 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4610 |
|
|
| 4618 |
|
|
| 4619 |
PCRE PERFORMANCE |
PCRE PERFORMANCE |
| 4620 |
|
|
| 4621 |
Certain items that may appear in regular expression patterns are more |
Two aspects of performance are discussed below: memory usage and pro- |
| 4622 |
efficient than others. It is more efficient to use a character class |
cessing time. The way you express your pattern as a regular expression |
| 4623 |
like [aeiou] than a set of alternatives such as (a|e|i|o|u). In gen- |
can affect both of them. |
| 4624 |
eral, the simplest construction that provides the required behaviour is |
|
| 4625 |
usually the most efficient. Jeffrey Friedl's book contains a lot of |
|
| 4626 |
useful general discussion about optimizing regular expressions for |
MEMORY USAGE |
| 4627 |
efficient performance. This document contains a few observations about |
|
| 4628 |
PCRE. |
Patterns are compiled by PCRE into a reasonably efficient byte code, so |
| 4629 |
|
that most simple patterns do not use much memory. However, there is one |
| 4630 |
|
case where memory usage can be unexpectedly large. When a parenthesized |
| 4631 |
|
subpattern has a quantifier with a minimum greater than 1 and/or a lim- |
| 4632 |
|
ited maximum, the whole subpattern is repeated in the compiled code. |
| 4633 |
|
For example, the pattern |
| 4634 |
|
|
| 4635 |
|
(abc|def){2,4} |
| 4636 |
|
|
| 4637 |
|
is compiled as if it were |
| 4638 |
|
|
| 4639 |
|
(abc|def)(abc|def)((abc|def)(abc|def)?)? |
| 4640 |
|
|
| 4641 |
|
(Technical aside: It is done this way so that backtrack points within |
| 4642 |
|
each of the repetitions can be independently maintained.) |
| 4643 |
|
|
| 4644 |
|
For regular expressions whose quantifiers use only small numbers, this |
| 4645 |
|
is not usually a problem. However, if the numbers are large, and par- |
| 4646 |
|
ticularly if such repetitions are nested, the memory usage can become |
| 4647 |
|
an embarrassment. For example, the very simple pattern |
| 4648 |
|
|
| 4649 |
|
((ab){1,1000}c){1,3} |
| 4650 |
|
|
| 4651 |
|
uses 51K bytes when compiled. When PCRE is compiled with its default |
| 4652 |
|
internal pointer size of two bytes, the size limit on a compiled pat- |
| 4653 |
|
tern is 64K, and this is reached with the above pattern if the outer |
| 4654 |
|
repetition is increased from 3 to 4. PCRE can be compiled to use larger |
| 4655 |
|
internal pointers and thus handle larger compiled patterns, but it is |
| 4656 |
|
better to try to rewrite your pattern to use less memory if you can. |
| 4657 |
|
|
| 4658 |
|
One way of reducing the memory usage for such patterns is to make use |
| 4659 |
|
of PCRE's "subroutine" facility. Re-writing the above pattern as |
| 4660 |
|
|
| 4661 |
|
((ab)(?2){0,999}c)(?1){0,2} |
| 4662 |
|
|
| 4663 |
|
reduces the memory requirements to 18K, and indeed it remains under 20K |
| 4664 |
|
even with the outer repetition increased to 100. However, this pattern |
| 4665 |
|
is not exactly equivalent, because the "subroutine" calls are treated |
| 4666 |
|
as atomic groups into which there can be no backtracking if there is a |
| 4667 |
|
subsequent matching failure. Therefore, PCRE cannot do this kind of |
| 4668 |
|
rewriting automatically. Furthermore, there is a noticeable loss of |
| 4669 |
|
speed when executing the modified pattern. Nevertheless, if the atomic |
| 4670 |
|
grouping is not a problem and the loss of speed is acceptable, this |
| 4671 |
|
kind of rewriting will allow you to process patterns that PCRE cannot |
| 4672 |
|
otherwise handle. |
| 4673 |
|
|
| 4674 |
|
|
| 4675 |
|
PROCESSING TIME |
| 4676 |
|
|
| 4677 |
|
Certain items in regular expression patterns are processed more effi- |
| 4678 |
|
ciently than others. It is more efficient to use a character class like |
| 4679 |
|
[aeiou] than a set of single-character alternatives such as |
| 4680 |
|
(a|e|i|o|u). In general, the simplest construction that provides the |
| 4681 |
|
required behaviour is usually the most efficient. Jeffrey Friedl's book |
| 4682 |
|
contains a lot of useful general discussion about optimizing regular |
| 4683 |
|
expressions for efficient performance. This document contains a few |
| 4684 |
|
observations about PCRE. |
| 4685 |
|
|
| 4686 |
Using Unicode character properties (the \p, \P, and \X escapes) is |
Using Unicode character properties (the \p, \P, and \X escapes) is |
| 4687 |
slow, because PCRE has to scan a structure that contains data for over |
slow, because PCRE has to scan a structure that contains data for over |
| 4715 |
take a long time to run when applied to a string that does not match. |
take a long time to run when applied to a string that does not match. |
| 4716 |
Consider the pattern fragment |
Consider the pattern fragment |
| 4717 |
|
|
| 4718 |
(a+)* |
^(a+)* |
| 4719 |
|
|
| 4720 |
This can match "aaaa" in 33 different ways, and this number increases |
This can match "aaaa" in 16 different ways, and this number increases |
| 4721 |
very rapidly as the string gets longer. (The * repeat can match 0, 1, |
very rapidly as the string gets longer. (The * repeat can match 0, 1, |
| 4722 |
2, 3, or 4 times, and for each of those cases other than 0, the + |
2, 3, or 4 times, and for each of those cases other than 0 or 4, the + |
| 4723 |
repeats can match different numbers of times.) When the remainder of |
repeats can match different numbers of times.) When the remainder of |
| 4724 |
the pattern is such that the entire match is going to fail, PCRE has in |
the pattern is such that the entire match is going to fail, PCRE has in |
| 4725 |
principle to try every possible variation, and this can take an |
principle to try every possible variation, and this can take an |
| 4726 |
extremely long time. |
extremely long time, even for relatively short strings. |
| 4727 |
|
|
| 4728 |
An optimization catches some of the more simple cases such as |
An optimization catches some of the more simple cases such as |
| 4729 |
|
|
| 4744 |
In many cases, the solution to this kind of performance issue is to use |
In many cases, the solution to this kind of performance issue is to use |
| 4745 |
an atomic group or a possessive quantifier. |
an atomic group or a possessive quantifier. |
| 4746 |
|
|
| 4747 |
Last updated: 28 February 2005 |
Last updated: 20 September 2006 |
| 4748 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 4749 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4750 |
|
|
| 4751 |
|
|
| 4959 |
|
|
| 4960 |
Philip Hazel |
Philip Hazel |
| 4961 |
University Computing Service, |
University Computing Service, |
| 4962 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QH, England. |
| 4963 |
|
|
| 4964 |
Last updated: 16 January 2006 |
Last updated: 16 January 2006 |
| 4965 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 5054 |
number of sub-patterns, "i"th captured sub-pattern is |
number of sub-patterns, "i"th captured sub-pattern is |
| 5055 |
ignored. |
ignored. |
| 5056 |
|
|
| 5057 |
|
CAVEAT: An optional sub-pattern that does not exist in the matched |
| 5058 |
|
string is assigned the empty string. Therefore, the following will |
| 5059 |
|
return false (because the empty string is not a valid number): |
| 5060 |
|
|
| 5061 |
|
int number; |
| 5062 |
|
pcrecpp::RE::FullMatch("abc", "[a-z]+(\d+)?", &number); |
| 5063 |
|
|
| 5064 |
The matching interface supports at most 16 arguments per call. If you |
The matching interface supports at most 16 arguments per call. If you |
| 5065 |
need more, consider using the more general interface |
need more, consider using the more general interface |
| 5066 |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
| 5067 |
|
|
| 5068 |
|
|
| 5069 |
|
QUOTING METACHARACTERS |
| 5070 |
|
|
| 5071 |
|
You can use the "QuoteMeta" operation to insert backslashes before all |
| 5072 |
|
potentially meaningful characters in a string. The returned string, |
| 5073 |
|
used as a regular expression, will exactly match the original string. |
| 5074 |
|
|
| 5075 |
|
Example: |
| 5076 |
|
string quoted = RE::QuoteMeta(unquoted); |
| 5077 |
|
|
| 5078 |
|
Note that it's legal to escape a character even if it has no special |
| 5079 |
|
meaning in a regular expression -- so this function does that. (This |
| 5080 |
|
also makes it identical to the perl function of the same name; see |
| 5081 |
|
"perldoc -f quotemeta".) For example, "1.5-2.0?" becomes |
| 5082 |
|
"1\.5\-2\.0\?". |
| 5083 |
|
|
| 5084 |
|
|
| 5085 |
PARTIAL MATCHES |
PARTIAL MATCHES |
| 5086 |
|
|
| 5087 |
You can use the "PartialMatch" operation when you want the pattern to |
You can use the "PartialMatch" operation when you want the pattern to |
| 5293 |
AUTHOR |
AUTHOR |
| 5294 |
|
|
| 5295 |
The C++ wrapper was contributed by Google Inc. |
The C++ wrapper was contributed by Google Inc. |
| 5296 |
Copyright (c) 2005 Google Inc. |
Copyright (c) 2006 Google Inc. |
| 5297 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5298 |
|
|
| 5299 |
|
|
| 5423 |
quantifier is used to stop any backtracking into the runs of non-"<" |
quantifier is used to stop any backtracking into the runs of non-"<" |
| 5424 |
characters, but that is not related to stack usage. |
characters, but that is not related to stack usage. |
| 5425 |
|
|
| 5426 |
|
This example shows that one way of avoiding stack problems when match- |
| 5427 |
|
ing long subject strings is to write repeated parenthesized subpatterns |
| 5428 |
|
to match more than one character whenever possible. |
| 5429 |
|
|
| 5430 |
In environments where stack memory is constrained, you might want to |
In environments where stack memory is constrained, you might want to |
| 5431 |
compile PCRE to use heap memory instead of stack for remembering back- |
compile PCRE to use heap memory instead of stack for remembering back- |
| 5432 |
up points. This makes it run a lot more slowly, however. Details of how |
up points. This makes it run a lot more slowly, however. Details of how |
| 5433 |
to do this are given in the pcrebuild documentation. |
to do this are given in the pcrebuild documentation. |
| 5434 |
|
|
| 5435 |
In Unix-like environments, there is not often a problem with the stack, |
In Unix-like environments, there is not often a problem with the stack |
| 5436 |
though the default limit on stack size varies from system to system. |
unless very long strings are involved, though the default limit on |
| 5437 |
Values from 8Mb to 64Mb are common. You can find your default limit by |
stack size varies from system to system. Values from 8Mb to 64Mb are |
| 5438 |
running the command: |
common. You can find your default limit by running the command: |
| 5439 |
|
|
| 5440 |
ulimit -s |
ulimit -s |
| 5441 |
|
|
| 5442 |
The effect of running out of stack is often SIGSEGV, though sometimes |
Unfortunately, the effect of running out of stack is often SIGSEGV, |
| 5443 |
an error message is given. You can normally increase the limit on stack |
though sometimes a more explicit error message is given. You can nor- |
| 5444 |
size by code such as this: |
mally increase the limit on stack size by code such as this: |
| 5445 |
|
|
| 5446 |
struct rlimit rlim; |
struct rlimit rlim; |
| 5447 |
getrlimit(RLIMIT_STACK, &rlim); |
getrlimit(RLIMIT_STACK, &rlim); |
| 5463 |
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
| 5464 |
should set the limit at 16000 recursions. A 64Mb stack, on the other |
should set the limit at 16000 recursions. A 64Mb stack, on the other |
| 5465 |
hand, can support around 128000 recursions. The pcretest test program |
hand, can support around 128000 recursions. The pcretest test program |
| 5466 |
has a command line option (-S) that can be used to increase its stack. |
has a command line option (-S) that can be used to increase the size of |
| 5467 |
|
its stack. |
| 5468 |
|
|
| 5469 |
Last updated: 29 June 2006 |
Last updated: 14 September 2006 |
| 5470 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
| 5471 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5472 |
|
|