| 114 |
There is no limit to the number of parenthesized subpatterns, but there |
There is no limit to the number of parenthesized subpatterns, but there |
| 115 |
can be no more than 65535 capturing subpatterns. |
can be no more than 65535 capturing subpatterns. |
| 116 |
|
|
| 117 |
|
If a non-capturing subpattern with an unlimited repetition quantifier |
| 118 |
|
can match an empty string, there is a limit of 1000 on the number of |
| 119 |
|
times it can be repeated while not matching an empty string - if it |
| 120 |
|
does match an empty string, the loop is immediately broken. |
| 121 |
|
|
| 122 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
| 123 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
| 124 |
|
|
| 125 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
| 126 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
| 127 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
| 128 |
inite repetition. This means that the available stack space may limit |
inite repetition. This means that the available stack space may limit |
| 129 |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
| 130 |
For a discussion of stack issues, see the pcrestack documentation. |
For a discussion of stack issues, see the pcrestack documentation. |
| 131 |
|
|
| 132 |
|
|
| 133 |
UTF-8 AND UNICODE PROPERTY SUPPORT |
UTF-8 AND UNICODE PROPERTY SUPPORT |
| 134 |
|
|
| 135 |
From release 3.3, PCRE has had some support for character strings |
From release 3.3, PCRE has had some support for character strings |
| 136 |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
| 137 |
to cover most common requirements, and in release 5.0 additional sup- |
to cover most common requirements, and in release 5.0 additional sup- |
| 138 |
port for Unicode general category properties was added. |
port for Unicode general category properties was added. |
| 139 |
|
|
| 140 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
| 141 |
support in the code, and, in addition, you must call pcre_compile() |
support in the code, and, in addition, you must call pcre_compile() |
| 142 |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
| 143 |
any subject strings that are matched against it are treated as UTF-8 |
any subject strings that are matched against it are treated as UTF-8 |
| 144 |
strings instead of just strings of bytes. |
strings instead of just strings of bytes. |
| 145 |
|
|
| 146 |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
| 147 |
the library will be a bit bigger, but the additional run time overhead |
the library will be a bit bigger, but the additional run time overhead |
| 148 |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
| 149 |
very big. |
very big. |
| 150 |
|
|
| 151 |
If PCRE is built with Unicode character property support (which implies |
If PCRE is built with Unicode character property support (which implies |
| 152 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
| 153 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
| 154 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
| 155 |
for a decimal number, the Unicode script names such as Arabic or Han, |
for a decimal number, the Unicode script names such as Arabic or Han, |
| 156 |
and the derived properties Any and L&. A full list is given in the |
and the derived properties Any and L&. A full list is given in the |
| 157 |
pcrepattern documentation. Only the short names for properties are sup- |
pcrepattern documentation. Only the short names for properties are sup- |
| 158 |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
| 159 |
ter}, is not supported. Furthermore, in Perl, many properties may |
ter}, is not supported. Furthermore, in Perl, many properties may |
| 160 |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
| 161 |
does not support this. |
does not support this. |
| 162 |
|
|
| 163 |
The following comments apply when PCRE is running in UTF-8 mode: |
The following comments apply when PCRE is running in UTF-8 mode: |
| 164 |
|
|
| 165 |
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and |
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and |
| 166 |
subjects are checked for validity on entry to the relevant functions. |
subjects are checked for validity on entry to the relevant functions. |
| 167 |
If an invalid UTF-8 string is passed, an error return is given. In some |
If an invalid UTF-8 string is passed, an error return is given. In some |
| 168 |
situations, you may already know that your strings are valid, and |
situations, you may already know that your strings are valid, and |
| 169 |
therefore want to skip these checks in order to improve performance. If |
therefore want to skip these checks in order to improve performance. If |
| 170 |
you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, |
you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, |
| 171 |
PCRE assumes that the pattern or subject it is given (respectively) |
PCRE assumes that the pattern or subject it is given (respectively) |
| 172 |
contains only valid UTF-8 codes. In this case, it does not diagnose an |
contains only valid UTF-8 codes. In this case, it does not diagnose an |
| 173 |
invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when |
invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when |
| 174 |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
| 175 |
crash. |
crash. |
| 176 |
|
|
| 177 |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
| 178 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
| 179 |
|
|
| 180 |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
| 181 |
characters for values greater than \177. |
characters for values greater than \177. |
| 182 |
|
|
| 183 |
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
| 184 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
| 185 |
|
|
| 186 |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
| 187 |
gle byte. |
gle byte. |
| 188 |
|
|
| 189 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
| 190 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
| 191 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
| 192 |
|
|
| 193 |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
| 194 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
| 195 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
| 196 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
| 197 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
| 198 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
| 199 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
| 200 |
\p{Nd}. |
\p{Nd}. |
| 201 |
|
|
| 202 |
8. Similarly, characters that match the POSIX named character classes |
8. Similarly, characters that match the POSIX named character classes |
| 203 |
are all low-valued characters. |
are all low-valued characters. |
| 204 |
|
|
| 205 |
9. Case-insensitive matching applies only to characters whose values |
9. However, the Perl 5.10 horizontal and vertical whitespace matching |
| 206 |
are less than 128, unless PCRE is built with Unicode property support. |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
| 207 |
Even when Unicode property support is available, PCRE still uses its |
acters. |
| 208 |
own character tables when checking the case of low-valued characters, |
|
| 209 |
so as not to degrade performance. The Unicode property information is |
10. Case-insensitive matching applies only to characters whose values |
| 210 |
|
are less than 128, unless PCRE is built with Unicode property support. |
| 211 |
|
Even when Unicode property support is available, PCRE still uses its |
| 212 |
|
own character tables when checking the case of low-valued characters, |
| 213 |
|
so as not to degrade performance. The Unicode property information is |
| 214 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
| 215 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
| 216 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
| 217 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
| 218 |
ported by PCRE. |
ported by PCRE. |
| 219 |
|
|
| 220 |
|
|
| 224 |
University Computing Service |
University Computing Service |
| 225 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
| 226 |
|
|
| 227 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
| 228 |
so I've taken it away. If you want to email me, use my two initials, |
so I've taken it away. If you want to email me, use my two initials, |
| 229 |
followed by the two digits 10, at the domain cam.ac.uk. |
followed by the two digits 10, at the domain cam.ac.uk. |
| 230 |
|
|
| 231 |
|
|
| 232 |
REVISION |
REVISION |
| 233 |
|
|
| 234 |
Last updated: 18 April 2007 |
Last updated: 30 July 2007 |
| 235 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 236 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 237 |
|
|
| 399 |
|
|
| 400 |
to the configure command. With this configuration, PCRE will use the |
to the configure command. With this configuration, PCRE will use the |
| 401 |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
| 402 |
ment functions. Separate functions are provided because the usage is |
ment functions. By default these point to malloc() and free(), but you |
| 403 |
very predictable: the block sizes requested are always the same, and |
can replace the pointers so that your own functions are used. |
| 404 |
the blocks are always freed in reverse order. A calling program might |
|
| 405 |
be able to implement optimized functions that perform better than the |
Separate functions are provided rather than using pcre_malloc and |
| 406 |
standard malloc() and free() functions. PCRE runs noticeably more |
pcre_free because the usage is very predictable: the block sizes |
| 407 |
slowly when built in this way. This option affects only the pcre_exec() |
requested are always the same, and the blocks are always freed in |
| 408 |
function; it is not relevant for the the pcre_dfa_exec() function. |
reverse order. A calling program might be able to implement optimized |
| 409 |
|
functions that perform better than malloc() and free(). PCRE runs |
| 410 |
|
noticeably more slowly when built in this way. This option affects only |
| 411 |
|
the pcre_exec() function; it is not relevant for the the |
| 412 |
|
pcre_dfa_exec() function. |
| 413 |
|
|
| 414 |
|
|
| 415 |
LIMITING PCRE RESOURCE USAGE |
LIMITING PCRE RESOURCE USAGE |
| 464 |
|
|
| 465 |
PCRE assumes by default that it will run in an environment where the |
PCRE assumes by default that it will run in an environment where the |
| 466 |
character code is ASCII (or Unicode, which is a superset of ASCII). |
character code is ASCII (or Unicode, which is a superset of ASCII). |
| 467 |
PCRE can, however, be compiled to run in an EBCDIC environment by |
This is the case for most computer operating systems. PCRE can, how- |
| 468 |
adding |
ever, be compiled to run in an EBCDIC environment by adding |
| 469 |
|
|
| 470 |
--enable-ebcdic |
--enable-ebcdic |
| 471 |
|
|
| 472 |
to the configure command. This setting implies --enable-rebuild-charta- |
to the configure command. This setting implies --enable-rebuild-charta- |
| 473 |
bles. |
bles. You should only use it if you know that you are in an EBCDIC |
| 474 |
|
environment (for example, an IBM mainframe operating system). |
| 475 |
|
|
| 476 |
|
|
| 477 |
SEE ALSO |
SEE ALSO |
| 488 |
|
|
| 489 |
REVISION |
REVISION |
| 490 |
|
|
| 491 |
Last updated: 16 April 2007 |
Last updated: 30 July 2007 |
| 492 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 493 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 494 |
|
|
| 1273 |
26 malformed number or name after (?( |
26 malformed number or name after (?( |
| 1274 |
27 conditional group contains more than two branches |
27 conditional group contains more than two branches |
| 1275 |
28 assertion expected after (?( |
28 assertion expected after (?( |
| 1276 |
29 (?R or (?digits must be followed by ) |
29 (?R or (?[+-]digits must be followed by ) |
| 1277 |
30 unknown POSIX class name |
30 unknown POSIX class name |
| 1278 |
31 POSIX collating elements are not supported |
31 POSIX collating elements are not supported |
| 1279 |
32 this version of PCRE is not compiled with PCRE_UTF8 support |
32 this version of PCRE is not compiled with PCRE_UTF8 support |
| 1294 |
47 unknown property name after \P or \p |
47 unknown property name after \P or \p |
| 1295 |
48 subpattern name is too long (maximum 32 characters) |
48 subpattern name is too long (maximum 32 characters) |
| 1296 |
49 too many named subpatterns (maximum 10,000) |
49 too many named subpatterns (maximum 10,000) |
| 1297 |
50 repeated subpattern is too long |
50 [this code is not in use] |
| 1298 |
51 octal value is greater than \377 (not in UTF-8 mode) |
51 octal value is greater than \377 (not in UTF-8 mode) |
| 1299 |
52 internal error: overran compiling workspace |
52 internal error: overran compiling workspace |
| 1300 |
53 internal error: previously-checked referenced subpattern not |
53 internal error: previously-checked referenced subpattern not |
| 1302 |
54 DEFINE group contains more than one branch |
54 DEFINE group contains more than one branch |
| 1303 |
55 repeating a DEFINE group is not allowed |
55 repeating a DEFINE group is not allowed |
| 1304 |
56 inconsistent NEWLINE options" |
56 inconsistent NEWLINE options" |
| 1305 |
|
57 \g is not followed by a braced name or an optionally braced |
| 1306 |
|
non-zero number |
| 1307 |
|
58 (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number |
| 1308 |
|
|
| 1309 |
|
|
| 1310 |
STUDYING A PATTERN |
STUDYING A PATTERN |
| 1497 |
|
|
| 1498 |
Return 1 if the (?J) option setting is used in the pattern, otherwise |
Return 1 if the (?J) option setting is used in the pattern, otherwise |
| 1499 |
0. The fourth argument should point to an int variable. The (?J) inter- |
0. The fourth argument should point to an int variable. The (?J) inter- |
| 1500 |
nal option setting changes the local PCRE_DUPNAMES value. |
nal option setting changes the local PCRE_DUPNAMES option. |
| 1501 |
|
|
| 1502 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
| 1503 |
|
|
| 1565 |
Return a copy of the options with which the pattern was compiled. The |
Return a copy of the options with which the pattern was compiled. The |
| 1566 |
fourth argument should point to an unsigned long int variable. These |
fourth argument should point to an unsigned long int variable. These |
| 1567 |
option bits are those specified in the call to pcre_compile(), modified |
option bits are those specified in the call to pcre_compile(), modified |
| 1568 |
by any top-level option settings within the pattern itself. |
by any top-level option settings at the start of the pattern itself. In |
| 1569 |
|
other words, they are the options that will be in force when matching |
| 1570 |
|
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
| 1571 |
|
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
| 1572 |
|
and PCRE_EXTENDED. |
| 1573 |
|
|
| 1574 |
A pattern is automatically anchored by PCRE if all of its top-level |
A pattern is automatically anchored by PCRE if all of its top-level |
| 1575 |
alternatives begin with one of the following: |
alternatives begin with one of the following: |
| 2060 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
| 2061 |
description above. |
description above. |
| 2062 |
|
|
|
PCRE_ERROR_NULLWSLIMIT (-22) |
|
|
|
|
|
When a group that can match an empty substring is repeated with an |
|
|
unbounded upper limit, the subject position at the start of the group |
|
|
must be remembered, so that a test for an empty string can be made when |
|
|
the end of the group is reached. Some workspace is required for this; |
|
|
if it runs out, this error is given. |
|
|
|
|
| 2063 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
| 2064 |
|
|
| 2065 |
An invalid combination of PCRE_NEWLINE_xxx options was given. |
An invalid combination of PCRE_NEWLINE_xxx options was given. |
| 2066 |
|
|
| 2067 |
Error numbers -16 to -20 are not used by pcre_exec(). |
Error numbers -16 to -20 and -22 are not used by pcre_exec(). |
| 2068 |
|
|
| 2069 |
|
|
| 2070 |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
| 2419 |
|
|
| 2420 |
REVISION |
REVISION |
| 2421 |
|
|
| 2422 |
Last updated: 04 June 2007 |
Last updated: 30 July 2007 |
| 2423 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 2424 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2425 |
|
|
| 2606 |
|
|
| 2607 |
This document describes the differences in the ways that PCRE and Perl |
This document describes the differences in the ways that PCRE and Perl |
| 2608 |
handle regular expressions. The differences described here are mainly |
handle regular expressions. The differences described here are mainly |
| 2609 |
with respect to Perl 5.8, though PCRE version 7.0 contains some fea- |
with respect to Perl 5.8, though PCRE versions 7.0 and later contain |
| 2610 |
tures that are expected to be in the forthcoming Perl 5.10. |
some features that are expected to be in the forthcoming Perl 5.10. |
| 2611 |
|
|
| 2612 |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
| 2613 |
of what it does have are given in the section on UTF-8 support in the |
of what it does have are given in the section on UTF-8 support in the |
| 2685 |
meta-character matches only at the very end of the string. |
meta-character matches only at the very end of the string. |
| 2686 |
|
|
| 2687 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- |
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- |
| 2688 |
cial meaning is faulted. Otherwise, like Perl, the backslash is |
cial meaning is faulted. Otherwise, like Perl, the backslash is quietly |
| 2689 |
ignored. (Perl can be made to issue a warning.) |
ignored. (Perl can be made to issue a warning.) |
| 2690 |
|
|
| 2691 |
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- |
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- |
| 2692 |
fiers is inverted, that is, by default they are not greedy, but if fol- |
fiers is inverted, that is, by default they are not greedy, but if fol- |
| 2718 |
|
|
| 2719 |
REVISION |
REVISION |
| 2720 |
|
|
| 2721 |
Last updated: 06 March 2007 |
Last updated: 13 June 2007 |
| 2722 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 2723 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2724 |
|
|
| 2951 |
|
|
| 2952 |
\d any decimal digit |
\d any decimal digit |
| 2953 |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
| 2954 |
|
\h any horizontal whitespace character |
| 2955 |
|
\H any character that is not a horizontal whitespace character |
| 2956 |
\s any whitespace character |
\s any whitespace character |
| 2957 |
\S any character that is not a whitespace character |
\S any character that is not a whitespace character |
| 2958 |
|
\v any vertical whitespace character |
| 2959 |
|
\V any character that is not a vertical whitespace character |
| 2960 |
\w any "word" character |
\w any "word" character |
| 2961 |
\W any "non-word" character |
\W any "non-word" character |
| 2962 |
|
|
| 2971 |
|
|
| 2972 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
| 2973 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
| 2974 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). (If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
| 2975 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
| 2976 |
ter. In PCRE, it never does.) |
ter. In PCRE, it never does. |
| 2977 |
|
|
| 2978 |
|
In UTF-8 mode, characters with values greater than 128 never match \d, |
| 2979 |
|
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
| 2980 |
|
code character property support is available. These sequences retain |
| 2981 |
|
their original meanings from before UTF-8 support was available, mainly |
| 2982 |
|
for efficiency reasons. |
| 2983 |
|
|
| 2984 |
|
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
| 2985 |
|
the other sequences, these do match certain high-valued codepoints in |
| 2986 |
|
UTF-8 mode. The horizontal space characters are: |
| 2987 |
|
|
| 2988 |
|
U+0009 Horizontal tab |
| 2989 |
|
U+0020 Space |
| 2990 |
|
U+00A0 Non-break space |
| 2991 |
|
U+1680 Ogham space mark |
| 2992 |
|
U+180E Mongolian vowel separator |
| 2993 |
|
U+2000 En quad |
| 2994 |
|
U+2001 Em quad |
| 2995 |
|
U+2002 En space |
| 2996 |
|
U+2003 Em space |
| 2997 |
|
U+2004 Three-per-em space |
| 2998 |
|
U+2005 Four-per-em space |
| 2999 |
|
U+2006 Six-per-em space |
| 3000 |
|
U+2007 Figure space |
| 3001 |
|
U+2008 Punctuation space |
| 3002 |
|
U+2009 Thin space |
| 3003 |
|
U+200A Hair space |
| 3004 |
|
U+202F Narrow no-break space |
| 3005 |
|
U+205F Medium mathematical space |
| 3006 |
|
U+3000 Ideographic space |
| 3007 |
|
|
| 3008 |
|
The vertical space characters are: |
| 3009 |
|
|
| 3010 |
|
U+000A Linefeed |
| 3011 |
|
U+000B Vertical tab |
| 3012 |
|
U+000C Formfeed |
| 3013 |
|
U+000D Carriage return |
| 3014 |
|
U+0085 Next line |
| 3015 |
|
U+2028 Line separator |
| 3016 |
|
U+2029 Paragraph separator |
| 3017 |
|
|
| 3018 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
| 3019 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
| 3021 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
| 3022 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
| 3023 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
| 3024 |
are used for accented letters, and these are matched by \w. |
are used for accented letters, and these are matched by \w. The use of |
| 3025 |
|
locales with Unicode is discouraged. |
|
In UTF-8 mode, characters with values greater than 128 never match \d, |
|
|
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
|
|
code character property support is available. The use of locales with |
|
|
Unicode is discouraged. |
|
| 3026 |
|
|
| 3027 |
Newline sequences |
Newline sequences |
| 3028 |
|
|
| 3029 |
Outside a character class, the escape sequence \R matches any Unicode |
Outside a character class, the escape sequence \R matches any Unicode |
| 3030 |
newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is |
newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is |
| 3031 |
equivalent to the following: |
equivalent to the following: |
| 3032 |
|
|
| 3033 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
| 3049 |
Unicode character properties |
Unicode character properties |
| 3050 |
|
|
| 3051 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
| 3052 |
tional escape sequences to match character properties are available |
tional escape sequences that match characters with specific properties |
| 3053 |
when UTF-8 mode is selected. They are: |
are available. When not in UTF-8 mode, these sequences are of course |
| 3054 |
|
limited to testing characters whose codepoints are less than 256, but |
| 3055 |
|
they do work in this mode. The extra escape sequences are: |
| 3056 |
|
|
| 3057 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
| 3058 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
| 3166 |
That is, it matches a character without the "mark" property, followed |
That is, it matches a character without the "mark" property, followed |
| 3167 |
by zero or more characters with the "mark" property, and treats the |
by zero or more characters with the "mark" property, and treats the |
| 3168 |
sequence as an atomic group (see below). Characters with the "mark" |
sequence as an atomic group (see below). Characters with the "mark" |
| 3169 |
property are typically accents that affect the preceding character. |
property are typically accents that affect the preceding character. |
| 3170 |
|
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
| 3171 |
|
matches any one character. |
| 3172 |
|
|
| 3173 |
Matching characters by Unicode property is not fast, because PCRE has |
Matching characters by Unicode property is not fast, because PCRE has |
| 3174 |
to search a structure that contains data for over fifteen thousand |
to search a structure that contains data for over fifteen thousand |
| 3594 |
"Saturday". |
"Saturday". |
| 3595 |
|
|
| 3596 |
|
|
| 3597 |
|
DUPLICATE SUBPATTERN NUMBERS |
| 3598 |
|
|
| 3599 |
|
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
| 3600 |
|
uses the same numbers for its capturing parentheses. Such a subpattern |
| 3601 |
|
starts with (?| and is itself a non-capturing subpattern. For example, |
| 3602 |
|
consider this pattern: |
| 3603 |
|
|
| 3604 |
|
(?|(Sat)ur|(Sun))day |
| 3605 |
|
|
| 3606 |
|
Because the two alternatives are inside a (?| group, both sets of cap- |
| 3607 |
|
turing parentheses are numbered one. Thus, when the pattern matches, |
| 3608 |
|
you can look at captured substring number one, whichever alternative |
| 3609 |
|
matched. This construct is useful when you want to capture part, but |
| 3610 |
|
not all, of one of a number of alternatives. Inside a (?| group, paren- |
| 3611 |
|
theses are numbered as usual, but the number is reset at the start of |
| 3612 |
|
each branch. The numbers of any capturing buffers that follow the sub- |
| 3613 |
|
pattern start after the highest number used in any branch. The follow- |
| 3614 |
|
ing example is taken from the Perl documentation. The numbers under- |
| 3615 |
|
neath show in which buffer the captured content will be stored. |
| 3616 |
|
|
| 3617 |
|
# before ---------------branch-reset----------- after |
| 3618 |
|
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
| 3619 |
|
# 1 2 2 3 2 3 4 |
| 3620 |
|
|
| 3621 |
|
A backreference or a recursive call to a numbered subpattern always |
| 3622 |
|
refers to the first one in the pattern with the given number. |
| 3623 |
|
|
| 3624 |
|
An alternative approach to using this "branch reset" feature is to use |
| 3625 |
|
duplicate named subpatterns, as described in the next section. |
| 3626 |
|
|
| 3627 |
|
|
| 3628 |
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
| 3629 |
|
|
| 3630 |
Identifying capturing parentheses by number is simple, but it can be |
Identifying capturing parentheses by number is simple, but it can be |
| 3664 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
| 3665 |
|
|
| 3666 |
There are five capturing substrings, but only one is ever set after a |
There are five capturing substrings, but only one is ever set after a |
| 3667 |
match. The convenience function for extracting the data by name |
match. (An alternative way of solving this problem is to use a "branch |
| 3668 |
returns the substring for the first (and in this example, the only) |
reset" subpattern, as described in the previous section.) |
| 3669 |
subpattern of that name that matched. This saves searching to find |
|
| 3670 |
which numbered subpattern it was. If you make a reference to a non- |
The convenience function for extracting the data by name returns the |
| 3671 |
unique named subpattern from elsewhere in the pattern, the one that |
substring for the first (and in this example, the only) subpattern of |
| 3672 |
corresponds to the lowest number is used. For further details of the |
that name that matched. This saves searching to find which numbered |
| 3673 |
interfaces for handling named subpatterns, see the pcreapi documenta- |
subpattern it was. If you make a reference to a non-unique named sub- |
| 3674 |
tion. |
pattern from elsewhere in the pattern, the one that corresponds to the |
| 3675 |
|
lowest number is used. For further details of the interfaces for han- |
| 3676 |
|
dling named subpatterns, see the pcreapi documentation. |
| 3677 |
|
|
| 3678 |
|
|
| 3679 |
REPETITION |
REPETITION |
| 4545 |
|
|
| 4546 |
REVISION |
REVISION |
| 4547 |
|
|
| 4548 |
Last updated: 29 May 2007 |
Last updated: 19 June 2007 |
| 4549 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 4550 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4551 |
|
|
| 4876 |
|
|
| 4877 |
COMPATIBILITY WITH DIFFERENT PCRE RELEASES |
COMPATIBILITY WITH DIFFERENT PCRE RELEASES |
| 4878 |
|
|
| 4879 |
The layout of the control block that is at the start of the data that |
In general, it is safest to recompile all saved patterns when you |
| 4880 |
makes up a compiled pattern was changed for release 5.0. If you have |
update to a new PCRE release, though not all updates actually require |
| 4881 |
any saved patterns that were compiled with previous releases (not a |
this. Recompiling is definitely needed for release 7.2. |
|
facility that was previously advertised), you will have to recompile |
|
|
them for release 5.0 and above. |
|
|
|
|
|
If you have any saved patterns in UTF-8 mode that use \p or \P that |
|
|
were compiled with any release up to and including 6.4, you will have |
|
|
to recompile them for release 6.5 and above. |
|
|
|
|
|
All saved patterns from earlier releases must be recompiled for release |
|
|
7.0 or higher, because there was an internal reorganization at that |
|
|
release. |
|
| 4882 |
|
|
| 4883 |
|
|
| 4884 |
AUTHOR |
AUTHOR |
| 4890 |
|
|
| 4891 |
REVISION |
REVISION |
| 4892 |
|
|
| 4893 |
Last updated: 24 April 2007 |
Last updated: 13 June 2007 |
| 4894 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 4895 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4896 |
|
|
| 5625 |
bility of matching an empty string. Comments in the code explain what |
bility of matching an empty string. Comments in the code explain what |
| 5626 |
is going on. |
is going on. |
| 5627 |
|
|
| 5628 |
If PCRE is installed in the standard include and library directories |
The demonstration program is automatically built if you use "./config- |
| 5629 |
for your system, you should be able to compile the demonstration pro- |
ure;make" to build PCRE. Otherwise, if PCRE is installed in the stan- |
| 5630 |
gram using this command: |
dard include and library directories for your system, you should be |
| 5631 |
|
able to compile the demonstration program using this command: |
| 5632 |
|
|
| 5633 |
gcc -o pcredemo pcredemo.c -lpcre |
gcc -o pcredemo pcredemo.c -lpcre |
| 5634 |
|
|
| 5635 |
If PCRE is installed elsewhere, you may need to add additional options |
If PCRE is installed elsewhere, you may need to add additional options |
| 5636 |
to the command line. For example, on a Unix-like system that has PCRE |
to the command line. For example, on a Unix-like system that has PCRE |
| 5637 |
installed in /usr/local, you can compile the demonstration program |
installed in /usr/local, you can compile the demonstration program |
| 5638 |
using a command like this: |
using a command like this: |
| 5639 |
|
|
| 5640 |
gcc -o pcredemo -I/usr/local/include pcredemo.c \ |
gcc -o pcredemo -I/usr/local/include pcredemo.c \ |
| 5641 |
-L/usr/local/lib -lpcre |
-L/usr/local/lib -lpcre |
| 5642 |
|
|
| 5643 |
Once you have compiled the demonstration program, you can run simple |
Once you have compiled the demonstration program, you can run simple |
| 5644 |
tests like this: |
tests like this: |
| 5645 |
|
|
| 5646 |
./pcredemo 'cat|dog' 'the cat sat on the mat' |
./pcredemo 'cat|dog' 'the cat sat on the mat' |
| 5647 |
./pcredemo -g 'cat|dog' 'the dog sat on the cat' |
./pcredemo -g 'cat|dog' 'the dog sat on the cat' |
| 5648 |
|
|
| 5649 |
Note that there is a much more comprehensive test program, called |
Note that there is a much more comprehensive test program, called |
| 5650 |
pcretest, which supports many more facilities for testing regular |
pcretest, which supports many more facilities for testing regular |
| 5651 |
expressions and the PCRE library. The pcredemo program is provided as a |
expressions and the PCRE library. The pcredemo program is provided as a |
| 5652 |
simple coding example. |
simple coding example. |
| 5653 |
|
|
| 5655 |
the standard library directory, you may get an error like this when you |
the standard library directory, you may get an error like this when you |
| 5656 |
try to run pcredemo: |
try to run pcredemo: |
| 5657 |
|
|
| 5658 |
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or |
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or |
| 5659 |
directory |
directory |
| 5660 |
|
|
| 5661 |
This is caused by the way shared library support works on those sys- |
This is caused by the way shared library support works on those sys- |
| 5662 |
tems. You need to add |
tems. You need to add |
| 5663 |
|
|
| 5664 |
-R/usr/local/lib |
-R/usr/local/lib |
| 5675 |
|
|
| 5676 |
REVISION |
REVISION |
| 5677 |
|
|
| 5678 |
Last updated: 06 March 2007 |
Last updated: 13 June 2007 |
| 5679 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 5680 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5681 |
PCRESTACK(3) PCRESTACK(3) |
PCRESTACK(3) PCRESTACK(3) |
| 5745 |
In environments where stack memory is constrained, you might want to |
In environments where stack memory is constrained, you might want to |
| 5746 |
compile PCRE to use heap memory instead of stack for remembering back- |
compile PCRE to use heap memory instead of stack for remembering back- |
| 5747 |
up points. This makes it run a lot more slowly, however. Details of how |
up points. This makes it run a lot more slowly, however. Details of how |
| 5748 |
to do this are given in the pcrebuild documentation. |
to do this are given in the pcrebuild documentation. When built in this |
| 5749 |
|
way, instead of using the stack, PCRE obtains and frees memory by call- |
| 5750 |
In Unix-like environments, there is not often a problem with the stack |
ing the functions that are pointed to by the pcre_stack_malloc and |
| 5751 |
unless very long strings are involved, though the default limit on |
pcre_stack_free variables. By default, these point to malloc() and |
| 5752 |
stack size varies from system to system. Values from 8Mb to 64Mb are |
free(), but you can replace the pointers to cause PCRE to use your own |
| 5753 |
|
functions. Since the block sizes are always the same, and are always |
| 5754 |
|
freed in reverse order, it may be possible to implement customized mem- |
| 5755 |
|
ory handlers that are more efficient than the standard functions. |
| 5756 |
|
|
| 5757 |
|
In Unix-like environments, there is not often a problem with the stack |
| 5758 |
|
unless very long strings are involved, though the default limit on |
| 5759 |
|
stack size varies from system to system. Values from 8Mb to 64Mb are |
| 5760 |
common. You can find your default limit by running the command: |
common. You can find your default limit by running the command: |
| 5761 |
|
|
| 5762 |
ulimit -s |
ulimit -s |
| 5763 |
|
|
| 5764 |
Unfortunately, the effect of running out of stack is often SIGSEGV, |
Unfortunately, the effect of running out of stack is often SIGSEGV, |
| 5765 |
though sometimes a more explicit error message is given. You can nor- |
though sometimes a more explicit error message is given. You can nor- |
| 5766 |
mally increase the limit on stack size by code such as this: |
mally increase the limit on stack size by code such as this: |
| 5767 |
|
|
| 5768 |
struct rlimit rlim; |
struct rlimit rlim; |
| 5770 |
rlim.rlim_cur = 100*1024*1024; |
rlim.rlim_cur = 100*1024*1024; |
| 5771 |
setrlimit(RLIMIT_STACK, &rlim); |
setrlimit(RLIMIT_STACK, &rlim); |
| 5772 |
|
|
| 5773 |
This reads the current limits (soft and hard) using getrlimit(), then |
This reads the current limits (soft and hard) using getrlimit(), then |
| 5774 |
attempts to increase the soft limit to 100Mb using setrlimit(). You |
attempts to increase the soft limit to 100Mb using setrlimit(). You |
| 5775 |
must do this before calling pcre_exec(). |
must do this before calling pcre_exec(). |
| 5776 |
|
|
| 5777 |
PCRE has an internal counter that can be used to limit the depth of |
PCRE has an internal counter that can be used to limit the depth of |
| 5778 |
recursion, and thus cause pcre_exec() to give an error code before it |
recursion, and thus cause pcre_exec() to give an error code before it |
| 5779 |
runs out of stack. By default, the limit is very large, and unlikely |
runs out of stack. By default, the limit is very large, and unlikely |
| 5780 |
ever to operate. It can be changed when PCRE is built, and it can also |
ever to operate. It can be changed when PCRE is built, and it can also |
| 5781 |
be set when pcre_exec() is called. For details of these interfaces, see |
be set when pcre_exec() is called. For details of these interfaces, see |
| 5782 |
the pcrebuild and pcreapi documentation. |
the pcrebuild and pcreapi documentation. |
| 5783 |
|
|
| 5784 |
As a very rough rule of thumb, you should reckon on about 500 bytes per |
As a very rough rule of thumb, you should reckon on about 500 bytes per |
| 5785 |
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
| 5786 |
should set the limit at 16000 recursions. A 64Mb stack, on the other |
should set the limit at 16000 recursions. A 64Mb stack, on the other |
| 5787 |
hand, can support around 128000 recursions. The pcretest test program |
hand, can support around 128000 recursions. The pcretest test program |
| 5788 |
has a command line option (-S) that can be used to increase the size of |
has a command line option (-S) that can be used to increase the size of |
| 5789 |
its stack. |
its stack. |
| 5790 |
|
|
| 5798 |
|
|
| 5799 |
REVISION |
REVISION |
| 5800 |
|
|
| 5801 |
Last updated: 12 March 2007 |
Last updated: 05 June 2007 |
| 5802 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 5803 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5804 |
|
|