| 28 |
mately with Perl 5.10, including support for UTF-8 encoded strings and |
mately with Perl 5.10, including support for UTF-8 encoded strings and |
| 29 |
Unicode general category properties. However, UTF-8 and Unicode support |
Unicode general category properties. However, UTF-8 and Unicode support |
| 30 |
has to be explicitly enabled; it is not the default. The Unicode tables |
has to be explicitly enabled; it is not the default. The Unicode tables |
| 31 |
correspond to Unicode release 5.0.0. |
correspond to Unicode release 5.1. |
| 32 |
|
|
| 33 |
In addition to the Perl-compatible matching function, PCRE contains an |
In addition to the Perl-compatible matching function, PCRE contains an |
| 34 |
alternative matching function that matches the same compiled patterns |
alternative matching function that matches the same compiled patterns |
| 136 |
|
|
| 137 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
| 138 |
support in the code, and, in addition, you must call pcre_compile() |
support in the code, and, in addition, you must call pcre_compile() |
| 139 |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
with the PCRE_UTF8 option flag, or the pattern must start with the |
| 140 |
any subject strings that are matched against it are treated as UTF-8 |
sequence (*UTF8). When either of these is the case, both the pattern |
| 141 |
strings instead of just strings of bytes. |
and any subject strings that are matched against it are treated as |
| 142 |
|
UTF-8 strings instead of just strings of bytes. |
| 143 |
|
|
| 144 |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
| 145 |
the library will be a bit bigger, but the additional run time overhead |
the library will be a bit bigger, but the additional run time overhead |
| 146 |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
| 147 |
very big. |
very big. |
| 148 |
|
|
| 149 |
If PCRE is built with Unicode character property support (which implies |
If PCRE is built with Unicode character property support (which implies |
| 150 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
| 151 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
| 152 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
| 153 |
for a decimal number, the Unicode script names such as Arabic or Han, |
for a decimal number, the Unicode script names such as Arabic or Han, |
| 154 |
and the derived properties Any and L&. A full list is given in the |
and the derived properties Any and L&. A full list is given in the |
| 155 |
pcrepattern documentation. Only the short names for properties are sup- |
pcrepattern documentation. Only the short names for properties are sup- |
| 156 |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
| 157 |
ter}, is not supported. Furthermore, in Perl, many properties may |
ter}, is not supported. Furthermore, in Perl, many properties may |
| 158 |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
| 159 |
does not support this. |
does not support this. |
| 160 |
|
|
| 161 |
Validity of UTF-8 strings |
Validity of UTF-8 strings |
| 162 |
|
|
| 163 |
When you set the PCRE_UTF8 flag, the strings passed as patterns and |
When you set the PCRE_UTF8 flag, the strings passed as patterns and |
| 164 |
subjects are (by default) checked for validity on entry to the relevant |
subjects are (by default) checked for validity on entry to the relevant |
| 165 |
functions. From release 7.3 of PCRE, the check is according the rules |
functions. From release 7.3 of PCRE, the check is according the rules |
| 166 |
of RFC 3629, which are themselves derived from the Unicode specifica- |
of RFC 3629, which are themselves derived from the Unicode specifica- |
| 167 |
tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
| 168 |
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
| 169 |
check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
| 170 |
to U+DFFF. |
to U+DFFF. |
| 171 |
|
|
| 172 |
The excluded code points are the "Low Surrogate Area" of Unicode, of |
The excluded code points are the "Low Surrogate Area" of Unicode, of |
| 173 |
which the Unicode Standard says this: "The Low Surrogate Area does not |
which the Unicode Standard says this: "The Low Surrogate Area does not |
| 174 |
contain any character assignments, consequently no character code |
contain any character assignments, consequently no character code |
| 175 |
charts or namelists are provided for this area. Surrogates are reserved |
charts or namelists are provided for this area. Surrogates are reserved |
| 176 |
for use with UTF-16 and then must be used in pairs." The code points |
for use with UTF-16 and then must be used in pairs." The code points |
| 177 |
that are encoded by UTF-16 pairs are available as independent code |
that are encoded by UTF-16 pairs are available as independent code |
| 178 |
points in the UTF-8 encoding. (In other words, the whole surrogate |
points in the UTF-8 encoding. (In other words, the whole surrogate |
| 179 |
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
| 180 |
|
|
| 181 |
If an invalid UTF-8 string is passed to PCRE, an error return |
If an invalid UTF-8 string is passed to PCRE, an error return |
| 182 |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
| 183 |
that your strings are valid, and therefore want to skip these checks in |
that your strings are valid, and therefore want to skip these checks in |
| 184 |
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
| 185 |
compile time or at run time, PCRE assumes that the pattern or subject |
compile time or at run time, PCRE assumes that the pattern or subject |
| 186 |
it is given (respectively) contains only valid UTF-8 codes. In this |
it is given (respectively) contains only valid UTF-8 codes. In this |
| 187 |
case, it does not diagnose an invalid UTF-8 string. |
case, it does not diagnose an invalid UTF-8 string. |
| 188 |
|
|
| 189 |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
| 190 |
what happens depends on why the string is invalid. If the string con- |
what happens depends on why the string is invalid. If the string con- |
| 191 |
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
| 192 |
string of characters in the range 0 to 0x7FFFFFFF. In other words, |
string of characters in the range 0 to 0x7FFFFFFF. In other words, |
| 193 |
apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
| 194 |
strings according to the more liberal rules of RFC 2279. However, if |
strings according to the more liberal rules of RFC 2279. However, if |
| 195 |
the string does not even conform to RFC 2279, the result is undefined. |
the string does not even conform to RFC 2279, the result is undefined. |
| 196 |
Your program may crash. |
Your program may crash. |
| 197 |
|
|
| 198 |
If you want to process strings of values in the full range 0 to |
If you want to process strings of values in the full range 0 to |
| 199 |
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
| 200 |
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
| 201 |
this situation, you will have to apply your own validity check. |
this situation, you will have to apply your own validity check. |
| 202 |
|
|
| 203 |
General comments about UTF-8 mode |
General comments about UTF-8 mode |
| 204 |
|
|
| 205 |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
| 206 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
| 207 |
|
|
| 208 |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
| 209 |
characters for values greater than \177. |
characters for values greater than \177. |
| 210 |
|
|
| 211 |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
| 212 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
| 213 |
|
|
| 214 |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
| 215 |
gle byte. |
gle byte. |
| 216 |
|
|
| 217 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
| 218 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
| 219 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
| 220 |
|
|
| 221 |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
| 222 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
| 223 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
| 224 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
| 225 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
| 226 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
| 227 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
| 228 |
\p{Nd}. Note that this also applies to \b, because it is defined in |
\p{Nd}. Note that this also applies to \b, because it is defined in |
| 229 |
terms of \w and \W. |
terms of \w and \W. |
| 230 |
|
|
| 231 |
7. Similarly, characters that match the POSIX named character classes |
7. Similarly, characters that match the POSIX named character classes |
| 232 |
are all low-valued characters. |
are all low-valued characters. |
| 233 |
|
|
| 234 |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
| 235 |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
| 236 |
acters. |
acters. |
| 237 |
|
|
| 238 |
9. Case-insensitive matching applies only to characters whose values |
9. Case-insensitive matching applies only to characters whose values |
| 239 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
| 240 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
| 241 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
| 242 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
| 243 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
| 244 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
| 245 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
| 246 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
| 247 |
ported by PCRE. |
ported by PCRE. |
| 248 |
|
|
| 249 |
|
|
| 253 |
University Computing Service |
University Computing Service |
| 254 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
| 255 |
|
|
| 256 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
| 257 |
so I've taken it away. If you want to email me, use my two initials, |
so I've taken it away. If you want to email me, use my two initials, |
| 258 |
followed by the two digits 10, at the domain cam.ac.uk. |
followed by the two digits 10, at the domain cam.ac.uk. |
| 259 |
|
|
| 260 |
|
|
| 261 |
REVISION |
REVISION |
| 262 |
|
|
| 263 |
Last updated: 18 March 2009 |
Last updated: 11 April 2009 |
| 264 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 265 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 266 |
|
|
| 1134 |
|
|
| 1135 |
The options argument contains various bit settings that affect the com- |
The options argument contains various bit settings that affect the com- |
| 1136 |
pilation. It should be zero if no options are required. The available |
pilation. It should be zero if no options are required. The available |
| 1137 |
options are described below. Some of them, in particular, those that |
options are described below. Some of them (in particular, those that |
| 1138 |
are compatible with Perl, can also be set and unset from within the |
are compatible with Perl, but also some others) can also be set and |
| 1139 |
pattern (see the detailed description in the pcrepattern documenta- |
unset from within the pattern (see the detailed description in the |
| 1140 |
tion). For these options, the contents of the options argument speci- |
pcrepattern documentation). For those options that can be different in |
| 1141 |
fies their initial settings at the start of compilation and execution. |
different parts of the pattern, the contents of the options argument |
| 1142 |
The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time |
specifies their initial settings at the start of compilation and execu- |
| 1143 |
of matching as well as at compile time. |
tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the |
| 1144 |
|
time of matching as well as at compile time. |
| 1145 |
|
|
| 1146 |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
| 1147 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
| 1148 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
| 1149 |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
| 1150 |
try to free it. The offset from the start of the pattern to the charac- |
try to free it. The offset from the start of the pattern to the charac- |
| 1151 |
ter where the error was discovered is placed in the variable pointed to |
ter where the error was discovered is placed in the variable pointed to |
| 1152 |
by erroffset, which must not be NULL. If it is, an immediate error is |
by erroffset, which must not be NULL. If it is, an immediate error is |
| 1153 |
given. |
given. |
| 1154 |
|
|
| 1155 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
| 1156 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
| 1157 |
via this argument in the event of an error. This is in addition to the |
via this argument in the event of an error. This is in addition to the |
| 1158 |
textual error message. Error codes and messages are listed below. |
textual error message. Error codes and messages are listed below. |
| 1159 |
|
|
| 1160 |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
| 1161 |
character tables that are built when PCRE is compiled, using the |
character tables that are built when PCRE is compiled, using the |
| 1162 |
default C locale. Otherwise, tableptr must be an address that is the |
default C locale. Otherwise, tableptr must be an address that is the |
| 1163 |
result of a call to pcre_maketables(). This value is stored with the |
result of a call to pcre_maketables(). This value is stored with the |
| 1164 |
compiled pattern, and used again by pcre_exec(), unless another table |
compiled pattern, and used again by pcre_exec(), unless another table |
| 1165 |
pointer is passed to it. For more discussion, see the section on locale |
pointer is passed to it. For more discussion, see the section on locale |
| 1166 |
support below. |
support below. |
| 1167 |
|
|
| 1168 |
This code fragment shows a typical straightforward call to pcre_com- |
This code fragment shows a typical straightforward call to pcre_com- |
| 1169 |
pile(): |
pile(): |
| 1170 |
|
|
| 1171 |
pcre *re; |
pcre *re; |
| 1178 |
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
| 1179 |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
| 1180 |
|
|
| 1181 |
The following names for option bits are defined in the pcre.h header |
The following names for option bits are defined in the pcre.h header |
| 1182 |
file: |
file: |
| 1183 |
|
|
| 1184 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 1185 |
|
|
| 1186 |
If this bit is set, the pattern is forced to be "anchored", that is, it |
If this bit is set, the pattern is forced to be "anchored", that is, it |
| 1187 |
is constrained to match only at the first matching point in the string |
is constrained to match only at the first matching point in the string |
| 1188 |
that is being searched (the "subject string"). This effect can also be |
that is being searched (the "subject string"). This effect can also be |
| 1189 |
achieved by appropriate constructs in the pattern itself, which is the |
achieved by appropriate constructs in the pattern itself, which is the |
| 1190 |
only way to do it in Perl. |
only way to do it in Perl. |
| 1191 |
|
|
| 1192 |
PCRE_AUTO_CALLOUT |
PCRE_AUTO_CALLOUT |
| 1193 |
|
|
| 1194 |
If this bit is set, pcre_compile() automatically inserts callout items, |
If this bit is set, pcre_compile() automatically inserts callout items, |
| 1195 |
all with number 255, before each pattern item. For discussion of the |
all with number 255, before each pattern item. For discussion of the |
| 1196 |
callout facility, see the pcrecallout documentation. |
callout facility, see the pcrecallout documentation. |
| 1197 |
|
|
| 1198 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
| 1199 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
| 1200 |
|
|
| 1201 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
| 1202 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
| 1203 |
or to match any Unicode newline sequence. The default is specified when |
or to match any Unicode newline sequence. The default is specified when |
| 1204 |
PCRE is built. It can be overridden from within the pattern, or by set- |
PCRE is built. It can be overridden from within the pattern, or by set- |
| 1205 |
ting an option when a compiled pattern is matched. |
ting an option when a compiled pattern is matched. |
| 1206 |
|
|
| 1207 |
PCRE_CASELESS |
PCRE_CASELESS |
| 1208 |
|
|
| 1209 |
If this bit is set, letters in the pattern match both upper and lower |
If this bit is set, letters in the pattern match both upper and lower |
| 1210 |
case letters. It is equivalent to Perl's /i option, and it can be |
case letters. It is equivalent to Perl's /i option, and it can be |
| 1211 |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
| 1212 |
always understands the concept of case for characters whose values are |
always understands the concept of case for characters whose values are |
| 1213 |
less than 128, so caseless matching is always possible. For characters |
less than 128, so caseless matching is always possible. For characters |
| 1214 |
with higher values, the concept of case is supported if PCRE is com- |
with higher values, the concept of case is supported if PCRE is com- |
| 1215 |
piled with Unicode property support, but not otherwise. If you want to |
piled with Unicode property support, but not otherwise. If you want to |
| 1216 |
use caseless matching for characters 128 and above, you must ensure |
use caseless matching for characters 128 and above, you must ensure |
| 1217 |
that PCRE is compiled with Unicode property support as well as with |
that PCRE is compiled with Unicode property support as well as with |
| 1218 |
UTF-8 support. |
UTF-8 support. |
| 1219 |
|
|
| 1220 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
| 1221 |
|
|
| 1222 |
If this bit is set, a dollar metacharacter in the pattern matches only |
If this bit is set, a dollar metacharacter in the pattern matches only |
| 1223 |
at the end of the subject string. Without this option, a dollar also |
at the end of the subject string. Without this option, a dollar also |
| 1224 |
matches immediately before a newline at the end of the string (but not |
matches immediately before a newline at the end of the string (but not |
| 1225 |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
| 1226 |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
| 1227 |
Perl, and no way to set it within a pattern. |
Perl, and no way to set it within a pattern. |
| 1228 |
|
|
| 1229 |
PCRE_DOTALL |
PCRE_DOTALL |
| 1230 |
|
|
| 1231 |
If this bit is set, a dot metacharater in the pattern matches all char- |
If this bit is set, a dot metacharater in the pattern matches all char- |
| 1232 |
acters, including those that indicate newline. Without it, a dot does |
acters, including those that indicate newline. Without it, a dot does |
| 1233 |
not match when the current position is at a newline. This option is |
not match when the current position is at a newline. This option is |
| 1234 |
equivalent to Perl's /s option, and it can be changed within a pattern |
equivalent to Perl's /s option, and it can be changed within a pattern |
| 1235 |
by a (?s) option setting. A negative class such as [^a] always matches |
by a (?s) option setting. A negative class such as [^a] always matches |
| 1236 |
newline characters, independent of the setting of this option. |
newline characters, independent of the setting of this option. |
| 1237 |
|
|
| 1238 |
PCRE_DUPNAMES |
PCRE_DUPNAMES |
| 1239 |
|
|
| 1240 |
If this bit is set, names used to identify capturing subpatterns need |
If this bit is set, names used to identify capturing subpatterns need |
| 1241 |
not be unique. This can be helpful for certain types of pattern when it |
not be unique. This can be helpful for certain types of pattern when it |
| 1242 |
is known that only one instance of the named subpattern can ever be |
is known that only one instance of the named subpattern can ever be |
| 1243 |
matched. There are more details of named subpatterns below; see also |
matched. There are more details of named subpatterns below; see also |
| 1244 |
the pcrepattern documentation. |
the pcrepattern documentation. |
| 1245 |
|
|
| 1246 |
PCRE_EXTENDED |
PCRE_EXTENDED |
| 1247 |
|
|
| 1248 |
If this bit is set, whitespace data characters in the pattern are |
If this bit is set, whitespace data characters in the pattern are |
| 1249 |
totally ignored except when escaped or inside a character class. White- |
totally ignored except when escaped or inside a character class. White- |
| 1250 |
space does not include the VT character (code 11). In addition, charac- |
space does not include the VT character (code 11). In addition, charac- |
| 1251 |
ters between an unescaped # outside a character class and the next new- |
ters between an unescaped # outside a character class and the next new- |
| 1252 |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
| 1253 |
option, and it can be changed within a pattern by a (?x) option set- |
option, and it can be changed within a pattern by a (?x) option set- |
| 1254 |
ting. |
ting. |
| 1255 |
|
|
| 1256 |
This option makes it possible to include comments inside complicated |
This option makes it possible to include comments inside complicated |
| 1257 |
patterns. Note, however, that this applies only to data characters. |
patterns. Note, however, that this applies only to data characters. |
| 1258 |
Whitespace characters may never appear within special character |
Whitespace characters may never appear within special character |
| 1259 |
sequences in a pattern, for example within the sequence (?( which |
sequences in a pattern, for example within the sequence (?( which |
| 1260 |
introduces a conditional subpattern. |
introduces a conditional subpattern. |
| 1261 |
|
|
| 1262 |
PCRE_EXTRA |
PCRE_EXTRA |
| 1263 |
|
|
| 1264 |
This option was invented in order to turn on additional functionality |
This option was invented in order to turn on additional functionality |
| 1265 |
of PCRE that is incompatible with Perl, but it is currently of very |
of PCRE that is incompatible with Perl, but it is currently of very |
| 1266 |
little use. When set, any backslash in a pattern that is followed by a |
little use. When set, any backslash in a pattern that is followed by a |
| 1267 |
letter that has no special meaning causes an error, thus reserving |
letter that has no special meaning causes an error, thus reserving |
| 1268 |
these combinations for future expansion. By default, as in Perl, a |
these combinations for future expansion. By default, as in Perl, a |
| 1269 |
backslash followed by a letter with no special meaning is treated as a |
backslash followed by a letter with no special meaning is treated as a |
| 1270 |
literal. (Perl can, however, be persuaded to give a warning for this.) |
literal. (Perl can, however, be persuaded to give a warning for this.) |
| 1271 |
There are at present no other features controlled by this option. It |
There are at present no other features controlled by this option. It |
| 1272 |
can also be set by a (?X) option setting within a pattern. |
can also be set by a (?X) option setting within a pattern. |
| 1273 |
|
|
| 1274 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
| 1275 |
|
|
| 1276 |
If this option is set, an unanchored pattern is required to match |
If this option is set, an unanchored pattern is required to match |
| 1277 |
before or at the first newline in the subject string, though the |
before or at the first newline in the subject string, though the |
| 1278 |
matched text may continue over the newline. |
matched text may continue over the newline. |
| 1279 |
|
|
| 1280 |
PCRE_JAVASCRIPT_COMPAT |
PCRE_JAVASCRIPT_COMPAT |
| 1281 |
|
|
| 1282 |
If this option is set, PCRE's behaviour is changed in some ways so that |
If this option is set, PCRE's behaviour is changed in some ways so that |
| 1283 |
it is compatible with JavaScript rather than Perl. The changes are as |
it is compatible with JavaScript rather than Perl. The changes are as |
| 1284 |
follows: |
follows: |
| 1285 |
|
|
| 1286 |
(1) A lone closing square bracket in a pattern causes a compile-time |
(1) A lone closing square bracket in a pattern causes a compile-time |
| 1287 |
error, because this is illegal in JavaScript (by default it is treated |
error, because this is illegal in JavaScript (by default it is treated |
| 1288 |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
| 1289 |
option is set. |
option is set. |
| 1290 |
|
|
| 1291 |
(2) At run time, a back reference to an unset subpattern group matches |
(2) At run time, a back reference to an unset subpattern group matches |
| 1292 |
an empty string (by default this causes the current matching alterna- |
an empty string (by default this causes the current matching alterna- |
| 1293 |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
| 1294 |
set (assuming it can find an "a" in the subject), whereas it fails by |
set (assuming it can find an "a" in the subject), whereas it fails by |
| 1295 |
default, for Perl compatibility. |
default, for Perl compatibility. |
| 1296 |
|
|
| 1297 |
PCRE_MULTILINE |
PCRE_MULTILINE |
| 1298 |
|
|
| 1299 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
| 1300 |
line of characters (even if it actually contains newlines). The "start |
line of characters (even if it actually contains newlines). The "start |
| 1301 |
of line" metacharacter (^) matches only at the start of the string, |
of line" metacharacter (^) matches only at the start of the string, |
| 1302 |
while the "end of line" metacharacter ($) matches only at the end of |
while the "end of line" metacharacter ($) matches only at the end of |
| 1303 |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
| 1304 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
| 1305 |
|
|
| 1306 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
| 1307 |
constructs match immediately following or immediately before internal |
constructs match immediately following or immediately before internal |
| 1308 |
newlines in the subject string, respectively, as well as at the very |
newlines in the subject string, respectively, as well as at the very |
| 1309 |
start and end. This is equivalent to Perl's /m option, and it can be |
start and end. This is equivalent to Perl's /m option, and it can be |
| 1310 |
changed within a pattern by a (?m) option setting. If there are no new- |
changed within a pattern by a (?m) option setting. If there are no new- |
| 1311 |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
| 1312 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
| 1313 |
|
|
| 1314 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
| 1317 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
| 1318 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
| 1319 |
|
|
| 1320 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
| 1321 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
| 1322 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
| 1323 |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
| 1324 |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
| 1325 |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
| 1326 |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
| 1327 |
recognized. The Unicode newline sequences are the three just mentioned, |
recognized. The Unicode newline sequences are the three just mentioned, |
| 1328 |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
| 1329 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
| 1330 |
(paragraph separator, U+2029). The last two are recognized only in |
(paragraph separator, U+2029). The last two are recognized only in |
| 1331 |
UTF-8 mode. |
UTF-8 mode. |
| 1332 |
|
|
| 1333 |
The newline setting in the options word uses three bits that are |
The newline setting in the options word uses three bits that are |
| 1334 |
treated as a number, giving eight possibilities. Currently only six are |
treated as a number, giving eight possibilities. Currently only six are |
| 1335 |
used (default plus the five values above). This means that if you set |
used (default plus the five values above). This means that if you set |
| 1336 |
more than one newline option, the combination may or may not be sensi- |
more than one newline option, the combination may or may not be sensi- |
| 1337 |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
| 1338 |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
| 1339 |
cause an error. |
cause an error. |
| 1340 |
|
|
| 1341 |
The only time that a line break is specially recognized when compiling |
The only time that a line break is specially recognized when compiling |
| 1342 |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
| 1343 |
character class is encountered. This indicates a comment that lasts |
character class is encountered. This indicates a comment that lasts |
| 1344 |
until after the next line break sequence. In other circumstances, line |
until after the next line break sequence. In other circumstances, line |
| 1345 |
break sequences are treated as literal data, except that in |
break sequences are treated as literal data, except that in |
| 1346 |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
| 1347 |
and are therefore ignored. |
and are therefore ignored. |
| 1348 |
|
|
| 1352 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
| 1353 |
|
|
| 1354 |
If this option is set, it disables the use of numbered capturing paren- |
If this option is set, it disables the use of numbered capturing paren- |
| 1355 |
theses in the pattern. Any opening parenthesis that is not followed by |
theses in the pattern. Any opening parenthesis that is not followed by |
| 1356 |
? behaves as if it were followed by ?: but named parentheses can still |
? behaves as if it were followed by ?: but named parentheses can still |
| 1357 |
be used for capturing (and they acquire numbers in the usual way). |
be used for capturing (and they acquire numbers in the usual way). |
| 1358 |
There is no equivalent of this option in Perl. |
There is no equivalent of this option in Perl. |
| 1359 |
|
|
| 1360 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
| 1361 |
|
|
| 1362 |
This option inverts the "greediness" of the quantifiers so that they |
This option inverts the "greediness" of the quantifiers so that they |
| 1363 |
are not greedy by default, but become greedy if followed by "?". It is |
are not greedy by default, but become greedy if followed by "?". It is |
| 1364 |
not compatible with Perl. It can also be set by a (?U) option setting |
not compatible with Perl. It can also be set by a (?U) option setting |
| 1365 |
within the pattern. |
within the pattern. |
| 1366 |
|
|
| 1367 |
PCRE_UTF8 |
PCRE_UTF8 |
| 1368 |
|
|
| 1369 |
This option causes PCRE to regard both the pattern and the subject as |
This option causes PCRE to regard both the pattern and the subject as |
| 1370 |
strings of UTF-8 characters instead of single-byte character strings. |
strings of UTF-8 characters instead of single-byte character strings. |
| 1371 |
However, it is available only when PCRE is built to include UTF-8 sup- |
However, it is available only when PCRE is built to include UTF-8 sup- |
| 1372 |
port. If not, the use of this option provokes an error. Details of how |
port. If not, the use of this option provokes an error. Details of how |
| 1373 |
this option changes the behaviour of PCRE are given in the section on |
this option changes the behaviour of PCRE are given in the section on |
| 1374 |
UTF-8 support in the main pcre page. |
UTF-8 support in the main pcre page. |
| 1375 |
|
|
| 1376 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
| 1377 |
|
|
| 1378 |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
| 1379 |
automatically checked. There is a discussion about the validity of |
automatically checked. There is a discussion about the validity of |
| 1380 |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
| 1381 |
bytes is found, pcre_compile() returns an error. If you already know |
bytes is found, pcre_compile() returns an error. If you already know |
| 1382 |
that your pattern is valid, and you want to skip this check for perfor- |
that your pattern is valid, and you want to skip this check for perfor- |
| 1383 |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
| 1384 |
set, the effect of passing an invalid UTF-8 string as a pattern is |
set, the effect of passing an invalid UTF-8 string as a pattern is |
| 1385 |
undefined. It may cause your program to crash. Note that this option |
undefined. It may cause your program to crash. Note that this option |
| 1386 |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
| 1387 |
UTF-8 validity checking of subject strings. |
UTF-8 validity checking of subject strings. |
| 1388 |
|
|
| 1389 |
|
|
| 1390 |
COMPILATION ERROR CODES |
COMPILATION ERROR CODES |
| 1391 |
|
|
| 1392 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
| 1393 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
| 1394 |
both compiling functions. As PCRE has developed, some error codes have |
both compiling functions. As PCRE has developed, some error codes have |
| 1395 |
fallen out of use. To avoid confusion, they have not been re-used. |
fallen out of use. To avoid confusion, they have not been re-used. |
| 1396 |
|
|
| 1397 |
0 no error |
0 no error |
| 1447 |
50 [this code is not in use] |
50 [this code is not in use] |
| 1448 |
51 octal value is greater than \377 (not in UTF-8 mode) |
51 octal value is greater than \377 (not in UTF-8 mode) |
| 1449 |
52 internal error: overran compiling workspace |
52 internal error: overran compiling workspace |
| 1450 |
53 internal error: previously-checked referenced subpattern not |
53 internal error: previously-checked referenced subpattern not |
| 1451 |
found |
found |
| 1452 |
54 DEFINE group contains more than one branch |
54 DEFINE group contains more than one branch |
| 1453 |
55 repeating a DEFINE group is not allowed |
55 repeating a DEFINE group is not allowed |
| 1462 |
63 digit expected after (?+ |
63 digit expected after (?+ |
| 1463 |
64 ] is an invalid data character in JavaScript compatibility mode |
64 ] is an invalid data character in JavaScript compatibility mode |
| 1464 |
|
|
| 1465 |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
| 1466 |
values may be used if the limits were changed when PCRE was built. |
values may be used if the limits were changed when PCRE was built. |
| 1467 |
|
|
| 1468 |
|
|
| 1471 |
pcre_extra *pcre_study(const pcre *code, int options |
pcre_extra *pcre_study(const pcre *code, int options |
| 1472 |
const char **errptr); |
const char **errptr); |
| 1473 |
|
|
| 1474 |
If a compiled pattern is going to be used several times, it is worth |
If a compiled pattern is going to be used several times, it is worth |
| 1475 |
spending more time analyzing it in order to speed up the time taken for |
spending more time analyzing it in order to speed up the time taken for |
| 1476 |
matching. The function pcre_study() takes a pointer to a compiled pat- |
matching. The function pcre_study() takes a pointer to a compiled pat- |
| 1477 |
tern as its first argument. If studying the pattern produces additional |
tern as its first argument. If studying the pattern produces additional |
| 1478 |
information that will help speed up matching, pcre_study() returns a |
information that will help speed up matching, pcre_study() returns a |
| 1479 |
pointer to a pcre_extra block, in which the study_data field points to |
pointer to a pcre_extra block, in which the study_data field points to |
| 1480 |
the results of the study. |
the results of the study. |
| 1481 |
|
|
| 1482 |
The returned value from pcre_study() can be passed directly to |
The returned value from pcre_study() can be passed directly to |
| 1483 |
pcre_exec(). However, a pcre_extra block also contains other fields |
pcre_exec(). However, a pcre_extra block also contains other fields |
| 1484 |
that can be set by the caller before the block is passed; these are |
that can be set by the caller before the block is passed; these are |
| 1485 |
described below in the section on matching a pattern. |
described below in the section on matching a pattern. |
| 1486 |
|
|
| 1487 |
If studying the pattern does not produce any additional information |
If studying the pattern does not produce any additional information |
| 1488 |
pcre_study() returns NULL. In that circumstance, if the calling program |
pcre_study() returns NULL. In that circumstance, if the calling program |
| 1489 |
wants to pass any of the other fields to pcre_exec(), it must set up |
wants to pass any of the other fields to pcre_exec(), it must set up |
| 1490 |
its own pcre_extra block. |
its own pcre_extra block. |
| 1491 |
|
|
| 1492 |
The second argument of pcre_study() contains option bits. At present, |
The second argument of pcre_study() contains option bits. At present, |
| 1493 |
no options are defined, and this argument should always be zero. |
no options are defined, and this argument should always be zero. |
| 1494 |
|
|
| 1495 |
The third argument for pcre_study() is a pointer for an error message. |
The third argument for pcre_study() is a pointer for an error message. |
| 1496 |
If studying succeeds (even if no data is returned), the variable it |
If studying succeeds (even if no data is returned), the variable it |
| 1497 |
points to is set to NULL. Otherwise it is set to point to a textual |
points to is set to NULL. Otherwise it is set to point to a textual |
| 1498 |
error message. This is a static string that is part of the library. You |
error message. This is a static string that is part of the library. You |
| 1499 |
must not try to free it. You should test the error pointer for NULL |
must not try to free it. You should test the error pointer for NULL |
| 1500 |
after calling pcre_study(), to be sure that it has run successfully. |
after calling pcre_study(), to be sure that it has run successfully. |
| 1501 |
|
|
| 1502 |
This is a typical call to pcre_study(): |
This is a typical call to pcre_study(): |
| 1508 |
&error); /* set to NULL or points to a message */ |
&error); /* set to NULL or points to a message */ |
| 1509 |
|
|
| 1510 |
At present, studying a pattern is useful only for non-anchored patterns |
At present, studying a pattern is useful only for non-anchored patterns |
| 1511 |
that do not have a single fixed starting character. A bitmap of possi- |
that do not have a single fixed starting character. A bitmap of possi- |
| 1512 |
ble starting bytes is created. |
ble starting bytes is created. |
| 1513 |
|
|
| 1514 |
|
|
| 1515 |
LOCALE SUPPORT |
LOCALE SUPPORT |
| 1516 |
|
|
| 1517 |
PCRE handles caseless matching, and determines whether characters are |
PCRE handles caseless matching, and determines whether characters are |
| 1518 |
letters, digits, or whatever, by reference to a set of tables, indexed |
letters, digits, or whatever, by reference to a set of tables, indexed |
| 1519 |
by character value. When running in UTF-8 mode, this applies only to |
by character value. When running in UTF-8 mode, this applies only to |
| 1520 |
characters with codes less than 128. Higher-valued codes never match |
characters with codes less than 128. Higher-valued codes never match |
| 1521 |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
| 1522 |
with Unicode character property support. The use of locales with Uni- |
with Unicode character property support. The use of locales with Uni- |
| 1523 |
code is discouraged. If you are handling characters with codes greater |
code is discouraged. If you are handling characters with codes greater |
| 1524 |
than 128, you should either use UTF-8 and Unicode, or use locales, but |
than 128, you should either use UTF-8 and Unicode, or use locales, but |
| 1525 |
not try to mix the two. |
not try to mix the two. |
| 1526 |
|
|
| 1527 |
PCRE contains an internal set of tables that are used when the final |
PCRE contains an internal set of tables that are used when the final |
| 1528 |
argument of pcre_compile() is NULL. These are sufficient for many |
argument of pcre_compile() is NULL. These are sufficient for many |
| 1529 |
applications. Normally, the internal tables recognize only ASCII char- |
applications. Normally, the internal tables recognize only ASCII char- |
| 1530 |
acters. However, when PCRE is built, it is possible to cause the inter- |
acters. However, when PCRE is built, it is possible to cause the inter- |
| 1531 |
nal tables to be rebuilt in the default "C" locale of the local system, |
nal tables to be rebuilt in the default "C" locale of the local system, |
| 1532 |
which may cause them to be different. |
which may cause them to be different. |
| 1533 |
|
|
| 1534 |
The internal tables can always be overridden by tables supplied by the |
The internal tables can always be overridden by tables supplied by the |
| 1535 |
application that calls PCRE. These may be created in a different locale |
application that calls PCRE. These may be created in a different locale |
| 1536 |
from the default. As more and more applications change to using Uni- |
from the default. As more and more applications change to using Uni- |
| 1537 |
code, the need for this locale support is expected to die away. |
code, the need for this locale support is expected to die away. |
| 1538 |
|
|
| 1539 |
External tables are built by calling the pcre_maketables() function, |
External tables are built by calling the pcre_maketables() function, |
| 1540 |
which has no arguments, in the relevant locale. The result can then be |
which has no arguments, in the relevant locale. The result can then be |
| 1541 |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
| 1542 |
example, to build and use tables that are appropriate for the French |
example, to build and use tables that are appropriate for the French |
| 1543 |
locale (where accented characters with values greater than 128 are |
locale (where accented characters with values greater than 128 are |
| 1544 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
| 1545 |
|
|
| 1546 |
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
| 1547 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
| 1548 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
| 1549 |
|
|
| 1550 |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
| 1551 |
if you are using Windows, the name for the French locale is "french". |
if you are using Windows, the name for the French locale is "french". |
| 1552 |
|
|
| 1553 |
When pcre_maketables() runs, the tables are built in memory that is |
When pcre_maketables() runs, the tables are built in memory that is |
| 1554 |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
| 1555 |
that the memory containing the tables remains available for as long as |
that the memory containing the tables remains available for as long as |
| 1556 |
it is needed. |
it is needed. |
| 1557 |
|
|
| 1558 |
The pointer that is passed to pcre_compile() is saved with the compiled |
The pointer that is passed to pcre_compile() is saved with the compiled |
| 1559 |
pattern, and the same tables are used via this pointer by pcre_study() |
pattern, and the same tables are used via this pointer by pcre_study() |
| 1560 |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
| 1561 |
tern, compilation, studying and matching all happen in the same locale, |
tern, compilation, studying and matching all happen in the same locale, |
| 1562 |
but different patterns can be compiled in different locales. |
but different patterns can be compiled in different locales. |
| 1563 |
|
|
| 1564 |
It is possible to pass a table pointer or NULL (indicating the use of |
It is possible to pass a table pointer or NULL (indicating the use of |
| 1565 |
the internal tables) to pcre_exec(). Although not intended for this |
the internal tables) to pcre_exec(). Although not intended for this |
| 1566 |
purpose, this facility could be used to match a pattern in a different |
purpose, this facility could be used to match a pattern in a different |
| 1567 |
locale from the one in which it was compiled. Passing table pointers at |
locale from the one in which it was compiled. Passing table pointers at |
| 1568 |
run time is discussed below in the section on matching a pattern. |
run time is discussed below in the section on matching a pattern. |
| 1569 |
|
|
| 1573 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
| 1574 |
int what, void *where); |
int what, void *where); |
| 1575 |
|
|
| 1576 |
The pcre_fullinfo() function returns information about a compiled pat- |
The pcre_fullinfo() function returns information about a compiled pat- |
| 1577 |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
| 1578 |
less retained for backwards compability (and is documented below). |
less retained for backwards compability (and is documented below). |
| 1579 |
|
|
| 1580 |
The first argument for pcre_fullinfo() is a pointer to the compiled |
The first argument for pcre_fullinfo() is a pointer to the compiled |
| 1581 |
pattern. The second argument is the result of pcre_study(), or NULL if |
pattern. The second argument is the result of pcre_study(), or NULL if |
| 1582 |
the pattern was not studied. The third argument specifies which piece |
the pattern was not studied. The third argument specifies which piece |
| 1583 |
of information is required, and the fourth argument is a pointer to a |
of information is required, and the fourth argument is a pointer to a |
| 1584 |
variable to receive the data. The yield of the function is zero for |
variable to receive the data. The yield of the function is zero for |
| 1585 |
success, or one of the following negative numbers: |
success, or one of the following negative numbers: |
| 1586 |
|
|
| 1587 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
| 1589 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
| 1590 |
PCRE_ERROR_BADOPTION the value of what was invalid |
PCRE_ERROR_BADOPTION the value of what was invalid |
| 1591 |
|
|
| 1592 |
The "magic number" is placed at the start of each compiled pattern as |
The "magic number" is placed at the start of each compiled pattern as |
| 1593 |
an simple check against passing an arbitrary memory pointer. Here is a |
an simple check against passing an arbitrary memory pointer. Here is a |
| 1594 |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
| 1595 |
pattern: |
pattern: |
| 1596 |
|
|
| 1597 |
int rc; |
int rc; |
| 1602 |
PCRE_INFO_SIZE, /* what is required */ |
PCRE_INFO_SIZE, /* what is required */ |
| 1603 |
&length); /* where to put the data */ |
&length); /* where to put the data */ |
| 1604 |
|
|
| 1605 |
The possible values for the third argument are defined in pcre.h, and |
The possible values for the third argument are defined in pcre.h, and |
| 1606 |
are as follows: |
are as follows: |
| 1607 |
|
|
| 1608 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
| 1609 |
|
|
| 1610 |
Return the number of the highest back reference in the pattern. The |
Return the number of the highest back reference in the pattern. The |
| 1611 |
fourth argument should point to an int variable. Zero is returned if |
fourth argument should point to an int variable. Zero is returned if |
| 1612 |
there are no back references. |
there are no back references. |
| 1613 |
|
|
| 1614 |
PCRE_INFO_CAPTURECOUNT |
PCRE_INFO_CAPTURECOUNT |
| 1615 |
|
|
| 1616 |
Return the number of capturing subpatterns in the pattern. The fourth |
Return the number of capturing subpatterns in the pattern. The fourth |
| 1617 |
argument should point to an int variable. |
argument should point to an int variable. |
| 1618 |
|
|
| 1619 |
PCRE_INFO_DEFAULT_TABLES |
PCRE_INFO_DEFAULT_TABLES |
| 1620 |
|
|
| 1621 |
Return a pointer to the internal default character tables within PCRE. |
Return a pointer to the internal default character tables within PCRE. |
| 1622 |
The fourth argument should point to an unsigned char * variable. This |
The fourth argument should point to an unsigned char * variable. This |
| 1623 |
information call is provided for internal use by the pcre_study() func- |
information call is provided for internal use by the pcre_study() func- |
| 1624 |
tion. External callers can cause PCRE to use its internal tables by |
tion. External callers can cause PCRE to use its internal tables by |
| 1625 |
passing a NULL table pointer. |
passing a NULL table pointer. |
| 1626 |
|
|
| 1627 |
PCRE_INFO_FIRSTBYTE |
PCRE_INFO_FIRSTBYTE |
| 1628 |
|
|
| 1629 |
Return information about the first byte of any matched string, for a |
Return information about the first byte of any matched string, for a |
| 1630 |
non-anchored pattern. The fourth argument should point to an int vari- |
non-anchored pattern. The fourth argument should point to an int vari- |
| 1631 |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
| 1632 |
is still recognized for backwards compatibility.) |
is still recognized for backwards compatibility.) |
| 1633 |
|
|
| 1634 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
| 1635 |
(cat|cow|coyote), its value is returned. Otherwise, if either |
(cat|cow|coyote), its value is returned. Otherwise, if either |
| 1636 |
|
|
| 1637 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
| 1638 |
branch starts with "^", or |
branch starts with "^", or |
| 1639 |
|
|
| 1640 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
| 1641 |
set (if it were set, the pattern would be anchored), |
set (if it were set, the pattern would be anchored), |
| 1642 |
|
|
| 1643 |
-1 is returned, indicating that the pattern matches only at the start |
-1 is returned, indicating that the pattern matches only at the start |
| 1644 |
of a subject string or after any newline within the string. Otherwise |
of a subject string or after any newline within the string. Otherwise |
| 1645 |
-2 is returned. For anchored patterns, -2 is returned. |
-2 is returned. For anchored patterns, -2 is returned. |
| 1646 |
|
|
| 1647 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
| 1648 |
|
|
| 1649 |
If the pattern was studied, and this resulted in the construction of a |
If the pattern was studied, and this resulted in the construction of a |
| 1650 |
256-bit table indicating a fixed set of bytes for the first byte in any |
256-bit table indicating a fixed set of bytes for the first byte in any |
| 1651 |
matching string, a pointer to the table is returned. Otherwise NULL is |
matching string, a pointer to the table is returned. Otherwise NULL is |
| 1652 |
returned. The fourth argument should point to an unsigned char * vari- |
returned. The fourth argument should point to an unsigned char * vari- |
| 1653 |
able. |
able. |
| 1654 |
|
|
| 1655 |
PCRE_INFO_HASCRORLF |
PCRE_INFO_HASCRORLF |
| 1656 |
|
|
| 1657 |
Return 1 if the pattern contains any explicit matches for CR or LF |
Return 1 if the pattern contains any explicit matches for CR or LF |
| 1658 |
characters, otherwise 0. The fourth argument should point to an int |
characters, otherwise 0. The fourth argument should point to an int |
| 1659 |
variable. An explicit match is either a literal CR or LF character, or |
variable. An explicit match is either a literal CR or LF character, or |
| 1660 |
\r or \n. |
\r or \n. |
| 1661 |
|
|
| 1662 |
PCRE_INFO_JCHANGED |
PCRE_INFO_JCHANGED |
| 1663 |
|
|
| 1664 |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
| 1665 |
otherwise 0. The fourth argument should point to an int variable. (?J) |
otherwise 0. The fourth argument should point to an int variable. (?J) |
| 1666 |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
| 1667 |
|
|
| 1668 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
| 1669 |
|
|
| 1670 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
| 1671 |
matched string, other than at its start, if such a byte has been |
matched string, other than at its start, if such a byte has been |
| 1672 |
recorded. The fourth argument should point to an int variable. If there |
recorded. The fourth argument should point to an int variable. If there |
| 1673 |
is no such byte, -1 is returned. For anchored patterns, a last literal |
is no such byte, -1 is returned. For anchored patterns, a last literal |
| 1674 |
byte is recorded only if it follows something of variable length. For |
byte is recorded only if it follows something of variable length. For |
| 1675 |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
| 1676 |
/^a\dz\d/ the returned value is -1. |
/^a\dz\d/ the returned value is -1. |
| 1677 |
|
|
| 1679 |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMEENTRYSIZE |
| 1680 |
PCRE_INFO_NAMETABLE |
PCRE_INFO_NAMETABLE |
| 1681 |
|
|
| 1682 |
PCRE supports the use of named as well as numbered capturing parenthe- |
PCRE supports the use of named as well as numbered capturing parenthe- |
| 1683 |
ses. The names are just an additional way of identifying the parenthe- |
ses. The names are just an additional way of identifying the parenthe- |
| 1684 |
ses, which still acquire numbers. Several convenience functions such as |
ses, which still acquire numbers. Several convenience functions such as |
| 1685 |
pcre_get_named_substring() are provided for extracting captured sub- |
pcre_get_named_substring() are provided for extracting captured sub- |
| 1686 |
strings by name. It is also possible to extract the data directly, by |
strings by name. It is also possible to extract the data directly, by |
| 1687 |
first converting the name to a number in order to access the correct |
first converting the name to a number in order to access the correct |
| 1688 |
pointers in the output vector (described with pcre_exec() below). To do |
pointers in the output vector (described with pcre_exec() below). To do |
| 1689 |
the conversion, you need to use the name-to-number map, which is |
the conversion, you need to use the name-to-number map, which is |
| 1690 |
described by these three values. |
described by these three values. |
| 1691 |
|
|
| 1692 |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
| 1693 |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
| 1694 |
of each entry; both of these return an int value. The entry size |
of each entry; both of these return an int value. The entry size |
| 1695 |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
| 1696 |
a pointer to the first entry of the table (a pointer to char). The |
a pointer to the first entry of the table (a pointer to char). The |
| 1697 |
first two bytes of each entry are the number of the capturing parenthe- |
first two bytes of each entry are the number of the capturing parenthe- |
| 1698 |
sis, most significant byte first. The rest of the entry is the corre- |
sis, most significant byte first. The rest of the entry is the corre- |
| 1699 |
sponding name, zero terminated. The names are in alphabetical order. |
sponding name, zero terminated. The names are in alphabetical order. |
| 1700 |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
| 1701 |
theses numbers. For example, consider the following pattern (assume |
theses numbers. For example, consider the following pattern (assume |
| 1702 |
PCRE_EXTENDED is set, so white space - including newlines - is |
PCRE_EXTENDED is set, so white space - including newlines - is |
| 1703 |
ignored): |
ignored): |
| 1704 |
|
|
| 1705 |
(?<date> (?<year>(\d\d)?\d\d) - |
(?<date> (?<year>(\d\d)?\d\d) - |
| 1706 |
(?<month>\d\d) - (?<day>\d\d) ) |
(?<month>\d\d) - (?<day>\d\d) ) |
| 1707 |
|
|
| 1708 |
There are four named subpatterns, so the table has four entries, and |
There are four named subpatterns, so the table has four entries, and |
| 1709 |
each entry in the table is eight bytes long. The table is as follows, |
each entry in the table is eight bytes long. The table is as follows, |
| 1710 |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
| 1711 |
as ??: |
as ??: |
| 1712 |
|
|
| 1715 |
00 04 m o n t h 00 |
00 04 m o n t h 00 |
| 1716 |
00 02 y e a r 00 ?? |
00 02 y e a r 00 ?? |
| 1717 |
|
|
| 1718 |
When writing code to extract data from named subpatterns using the |
When writing code to extract data from named subpatterns using the |
| 1719 |
name-to-number map, remember that the length of the entries is likely |
name-to-number map, remember that the length of the entries is likely |
| 1720 |
to be different for each compiled pattern. |
to be different for each compiled pattern. |
| 1721 |
|
|
| 1722 |
PCRE_INFO_OKPARTIAL |
PCRE_INFO_OKPARTIAL |
| 1723 |
|
|
| 1724 |
Return 1 if the pattern can be used for partial matching, otherwise 0. |
Return 1 if the pattern can be used for partial matching, otherwise 0. |
| 1725 |
The fourth argument should point to an int variable. The pcrepartial |
The fourth argument should point to an int variable. The pcrepartial |
| 1726 |
documentation lists the restrictions that apply to patterns when par- |
documentation lists the restrictions that apply to patterns when par- |
| 1727 |
tial matching is used. |
tial matching is used. |
| 1728 |
|
|
| 1729 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
| 1730 |
|
|
| 1731 |
Return a copy of the options with which the pattern was compiled. The |
Return a copy of the options with which the pattern was compiled. The |
| 1732 |
fourth argument should point to an unsigned long int variable. These |
fourth argument should point to an unsigned long int variable. These |
| 1733 |
option bits are those specified in the call to pcre_compile(), modified |
option bits are those specified in the call to pcre_compile(), modified |
| 1734 |
by any top-level option settings at the start of the pattern itself. In |
by any top-level option settings at the start of the pattern itself. In |
| 1735 |
other words, they are the options that will be in force when matching |
other words, they are the options that will be in force when matching |
| 1736 |
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
| 1737 |
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
| 1738 |
and PCRE_EXTENDED. |
and PCRE_EXTENDED. |
| 1739 |
|
|
| 1740 |
A pattern is automatically anchored by PCRE if all of its top-level |
A pattern is automatically anchored by PCRE if all of its top-level |
| 1741 |
alternatives begin with one of the following: |
alternatives begin with one of the following: |
| 1742 |
|
|
| 1743 |
^ unless PCRE_MULTILINE is set |
^ unless PCRE_MULTILINE is set |
| 1751 |
|
|
| 1752 |
PCRE_INFO_SIZE |
PCRE_INFO_SIZE |
| 1753 |
|
|
| 1754 |
Return the size of the compiled pattern, that is, the value that was |
Return the size of the compiled pattern, that is, the value that was |
| 1755 |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
| 1756 |
which to place the compiled data. The fourth argument should point to a |
which to place the compiled data. The fourth argument should point to a |
| 1757 |
size_t variable. |
size_t variable. |
| 1759 |
PCRE_INFO_STUDYSIZE |
PCRE_INFO_STUDYSIZE |
| 1760 |
|
|
| 1761 |
Return the size of the data block pointed to by the study_data field in |
Return the size of the data block pointed to by the study_data field in |
| 1762 |
a pcre_extra block. That is, it is the value that was passed to |
a pcre_extra block. That is, it is the value that was passed to |
| 1763 |
pcre_malloc() when PCRE was getting memory into which to place the data |
pcre_malloc() when PCRE was getting memory into which to place the data |
| 1764 |
created by pcre_study(). The fourth argument should point to a size_t |
created by pcre_study(). The fourth argument should point to a size_t |
| 1765 |
variable. |
variable. |
| 1766 |
|
|
| 1767 |
|
|
| 1769 |
|
|
| 1770 |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
| 1771 |
|
|
| 1772 |
The pcre_info() function is now obsolete because its interface is too |
The pcre_info() function is now obsolete because its interface is too |
| 1773 |
restrictive to return all the available data about a compiled pattern. |
restrictive to return all the available data about a compiled pattern. |
| 1774 |
New programs should use pcre_fullinfo() instead. The yield of |
New programs should use pcre_fullinfo() instead. The yield of |
| 1775 |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
| 1776 |
lowing negative numbers: |
lowing negative numbers: |
| 1777 |
|
|
| 1778 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
| 1779 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
| 1780 |
|
|
| 1781 |
If the optptr argument is not NULL, a copy of the options with which |
If the optptr argument is not NULL, a copy of the options with which |
| 1782 |
the pattern was compiled is placed in the integer it points to (see |
the pattern was compiled is placed in the integer it points to (see |
| 1783 |
PCRE_INFO_OPTIONS above). |
PCRE_INFO_OPTIONS above). |
| 1784 |
|
|
| 1785 |
If the pattern is not anchored and the firstcharptr argument is not |
If the pattern is not anchored and the firstcharptr argument is not |
| 1786 |
NULL, it is used to pass back information about the first character of |
NULL, it is used to pass back information about the first character of |
| 1787 |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
| 1788 |
|
|
| 1789 |
|
|
| 1791 |
|
|
| 1792 |
int pcre_refcount(pcre *code, int adjust); |
int pcre_refcount(pcre *code, int adjust); |
| 1793 |
|
|
| 1794 |
The pcre_refcount() function is used to maintain a reference count in |
The pcre_refcount() function is used to maintain a reference count in |
| 1795 |
the data block that contains a compiled pattern. It is provided for the |
the data block that contains a compiled pattern. It is provided for the |
| 1796 |
benefit of applications that operate in an object-oriented manner, |
benefit of applications that operate in an object-oriented manner, |
| 1797 |
where different parts of the application may be using the same compiled |
where different parts of the application may be using the same compiled |
| 1798 |
pattern, but you want to free the block when they are all done. |
pattern, but you want to free the block when they are all done. |
| 1799 |
|
|
| 1800 |
When a pattern is compiled, the reference count field is initialized to |
When a pattern is compiled, the reference count field is initialized to |
| 1801 |
zero. It is changed only by calling this function, whose action is to |
zero. It is changed only by calling this function, whose action is to |
| 1802 |
add the adjust value (which may be positive or negative) to it. The |
add the adjust value (which may be positive or negative) to it. The |
| 1803 |
yield of the function is the new value. However, the value of the count |
yield of the function is the new value. However, the value of the count |
| 1804 |
is constrained to lie between 0 and 65535, inclusive. If the new value |
is constrained to lie between 0 and 65535, inclusive. If the new value |
| 1805 |
is outside these limits, it is forced to the appropriate limit value. |
is outside these limits, it is forced to the appropriate limit value. |
| 1806 |
|
|
| 1807 |
Except when it is zero, the reference count is not correctly preserved |
Except when it is zero, the reference count is not correctly preserved |
| 1808 |
if a pattern is compiled on one host and then transferred to a host |
if a pattern is compiled on one host and then transferred to a host |
| 1809 |
whose byte-order is different. (This seems a highly unlikely scenario.) |
whose byte-order is different. (This seems a highly unlikely scenario.) |
| 1810 |
|
|
| 1811 |
|
|
| 1815 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
| 1816 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
| 1817 |
|
|
| 1818 |
The function pcre_exec() is called to match a subject string against a |
The function pcre_exec() is called to match a subject string against a |
| 1819 |
compiled pattern, which is passed in the code argument. If the pattern |
compiled pattern, which is passed in the code argument. If the pattern |
| 1820 |
has been studied, the result of the study should be passed in the extra |
has been studied, the result of the study should be passed in the extra |
| 1821 |
argument. This function is the main matching facility of the library, |
argument. This function is the main matching facility of the library, |
| 1822 |
and it operates in a Perl-like manner. For specialist use there is also |
and it operates in a Perl-like manner. For specialist use there is also |
| 1823 |
an alternative matching function, which is described below in the sec- |
an alternative matching function, which is described below in the sec- |
| 1824 |
tion about the pcre_dfa_exec() function. |
tion about the pcre_dfa_exec() function. |
| 1825 |
|
|
| 1826 |
In most applications, the pattern will have been compiled (and option- |
In most applications, the pattern will have been compiled (and option- |
| 1827 |
ally studied) in the same process that calls pcre_exec(). However, it |
ally studied) in the same process that calls pcre_exec(). However, it |
| 1828 |
is possible to save compiled patterns and study data, and then use them |
is possible to save compiled patterns and study data, and then use them |
| 1829 |
later in different processes, possibly even on different hosts. For a |
later in different processes, possibly even on different hosts. For a |
| 1830 |
discussion about this, see the pcreprecompile documentation. |
discussion about this, see the pcreprecompile documentation. |
| 1831 |
|
|
| 1832 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_exec(): |
| 1845 |
|
|
| 1846 |
Extra data for pcre_exec() |
Extra data for pcre_exec() |
| 1847 |
|
|
| 1848 |
If the extra argument is not NULL, it must point to a pcre_extra data |
If the extra argument is not NULL, it must point to a pcre_extra data |
| 1849 |
block. The pcre_study() function returns such a block (when it doesn't |
block. The pcre_study() function returns such a block (when it doesn't |
| 1850 |
return NULL), but you can also create one for yourself, and pass addi- |
return NULL), but you can also create one for yourself, and pass addi- |
| 1851 |
tional information in it. The pcre_extra block contains the following |
tional information in it. The pcre_extra block contains the following |
| 1852 |
fields (not necessarily in this order): |
fields (not necessarily in this order): |
| 1853 |
|
|
| 1854 |
unsigned long int flags; |
unsigned long int flags; |
| 1858 |
void *callout_data; |
void *callout_data; |
| 1859 |
const unsigned char *tables; |
const unsigned char *tables; |
| 1860 |
|
|
| 1861 |
The flags field is a bitmap that specifies which of the other fields |
The flags field is a bitmap that specifies which of the other fields |
| 1862 |
are set. The flag bits are: |
are set. The flag bits are: |
| 1863 |
|
|
| 1864 |
PCRE_EXTRA_STUDY_DATA |
PCRE_EXTRA_STUDY_DATA |
| 1867 |
PCRE_EXTRA_CALLOUT_DATA |
PCRE_EXTRA_CALLOUT_DATA |
| 1868 |
PCRE_EXTRA_TABLES |
PCRE_EXTRA_TABLES |
| 1869 |
|
|
| 1870 |
Other flag bits should be set to zero. The study_data field is set in |
Other flag bits should be set to zero. The study_data field is set in |
| 1871 |
the pcre_extra block that is returned by pcre_study(), together with |
the pcre_extra block that is returned by pcre_study(), together with |
| 1872 |
the appropriate flag bit. You should not set this yourself, but you may |
the appropriate flag bit. You should not set this yourself, but you may |
| 1873 |
add to the block by setting the other fields and their corresponding |
add to the block by setting the other fields and their corresponding |
| 1874 |
flag bits. |
flag bits. |
| 1875 |
|
|
| 1876 |
The match_limit field provides a means of preventing PCRE from using up |
The match_limit field provides a means of preventing PCRE from using up |
| 1877 |
a vast amount of resources when running patterns that are not going to |
a vast amount of resources when running patterns that are not going to |
| 1878 |
match, but which have a very large number of possibilities in their |
match, but which have a very large number of possibilities in their |
| 1879 |
search trees. The classic example is the use of nested unlimited |
search trees. The classic example is the use of nested unlimited |
| 1880 |
repeats. |
repeats. |
| 1881 |
|
|
| 1882 |
Internally, PCRE uses a function called match() which it calls repeat- |
Internally, PCRE uses a function called match() which it calls repeat- |
| 1883 |
edly (sometimes recursively). The limit set by match_limit is imposed |
edly (sometimes recursively). The limit set by match_limit is imposed |
| 1884 |
on the number of times this function is called during a match, which |
on the number of times this function is called during a match, which |
| 1885 |
has the effect of limiting the amount of backtracking that can take |
has the effect of limiting the amount of backtracking that can take |
| 1886 |
place. For patterns that are not anchored, the count restarts from zero |
place. For patterns that are not anchored, the count restarts from zero |
| 1887 |
for each position in the subject string. |
for each position in the subject string. |
| 1888 |
|
|
| 1889 |
The default value for the limit can be set when PCRE is built; the |
The default value for the limit can be set when PCRE is built; the |
| 1890 |
default default is 10 million, which handles all but the most extreme |
default default is 10 million, which handles all but the most extreme |
| 1891 |
cases. You can override the default by suppling pcre_exec() with a |
cases. You can override the default by suppling pcre_exec() with a |
| 1892 |
pcre_extra block in which match_limit is set, and |
pcre_extra block in which match_limit is set, and |
| 1893 |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
| 1894 |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
| 1895 |
|
|
| 1896 |
The match_limit_recursion field is similar to match_limit, but instead |
The match_limit_recursion field is similar to match_limit, but instead |
| 1897 |
of limiting the total number of times that match() is called, it limits |
of limiting the total number of times that match() is called, it limits |
| 1898 |
the depth of recursion. The recursion depth is a smaller number than |
the depth of recursion. The recursion depth is a smaller number than |
| 1899 |
the total number of calls, because not all calls to match() are recur- |
the total number of calls, because not all calls to match() are recur- |
| 1900 |
sive. This limit is of use only if it is set smaller than match_limit. |
sive. This limit is of use only if it is set smaller than match_limit. |
| 1901 |
|
|
| 1902 |
Limiting the recursion depth limits the amount of stack that can be |
Limiting the recursion depth limits the amount of stack that can be |
| 1903 |
used, or, when PCRE has been compiled to use memory on the heap instead |
used, or, when PCRE has been compiled to use memory on the heap instead |
| 1904 |
of the stack, the amount of heap memory that can be used. |
of the stack, the amount of heap memory that can be used. |
| 1905 |
|
|
| 1906 |
The default value for match_limit_recursion can be set when PCRE is |
The default value for match_limit_recursion can be set when PCRE is |
| 1907 |
built; the default default is the same value as the default for |
built; the default default is the same value as the default for |
| 1908 |
match_limit. You can override the default by suppling pcre_exec() with |
match_limit. You can override the default by suppling pcre_exec() with |
| 1909 |
a pcre_extra block in which match_limit_recursion is set, and |
a pcre_extra block in which match_limit_recursion is set, and |
| 1910 |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
| 1911 |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
| 1912 |
|
|
| 1913 |
The pcre_callout field is used in conjunction with the "callout" fea- |
The pcre_callout field is used in conjunction with the "callout" fea- |
| 1914 |
ture, which is described in the pcrecallout documentation. |
ture, which is described in the pcrecallout documentation. |
| 1915 |
|
|
| 1916 |
The tables field is used to pass a character tables pointer to |
The tables field is used to pass a character tables pointer to |
| 1917 |
pcre_exec(); this overrides the value that is stored with the compiled |
pcre_exec(); this overrides the value that is stored with the compiled |
| 1918 |
pattern. A non-NULL value is stored with the compiled pattern only if |
pattern. A non-NULL value is stored with the compiled pattern only if |
| 1919 |
custom tables were supplied to pcre_compile() via its tableptr argu- |
custom tables were supplied to pcre_compile() via its tableptr argu- |
| 1920 |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
| 1921 |
PCRE's internal tables to be used. This facility is helpful when re- |
PCRE's internal tables to be used. This facility is helpful when re- |
| 1922 |
using patterns that have been saved after compiling with an external |
using patterns that have been saved after compiling with an external |
| 1923 |
set of tables, because the external tables might be at a different |
set of tables, because the external tables might be at a different |
| 1924 |
address when pcre_exec() is called. See the pcreprecompile documenta- |
address when pcre_exec() is called. See the pcreprecompile documenta- |
| 1925 |
tion for a discussion of saving compiled patterns for later use. |
tion for a discussion of saving compiled patterns for later use. |
| 1926 |
|
|
| 1927 |
Option bits for pcre_exec() |
Option bits for pcre_exec() |
| 1928 |
|
|
| 1929 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
| 1930 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
| 1931 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE, |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE, |
| 1932 |
PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
| 1933 |
|
|
| 1934 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 1935 |
|
|
| 1936 |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
| 1937 |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
| 1938 |
turned out to be anchored by virtue of its contents, it cannot be made |
turned out to be anchored by virtue of its contents, it cannot be made |
| 1939 |
unachored at matching time. |
unachored at matching time. |
| 1940 |
|
|
| 1941 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
| 1942 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
| 1943 |
|
|
| 1944 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
| 1945 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
| 1946 |
or to match any Unicode newline sequence. These options override the |
or to match any Unicode newline sequence. These options override the |
| 1947 |
choice that was made or defaulted when the pattern was compiled. |
choice that was made or defaulted when the pattern was compiled. |
| 1948 |
|
|
| 1949 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
| 1952 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
| 1953 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
| 1954 |
|
|
| 1955 |
These options override the newline definition that was chosen or |
These options override the newline definition that was chosen or |
| 1956 |
defaulted when the pattern was compiled. For details, see the descrip- |
defaulted when the pattern was compiled. For details, see the descrip- |
| 1957 |
tion of pcre_compile() above. During matching, the newline choice |
tion of pcre_compile() above. During matching, the newline choice |
| 1958 |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
| 1959 |
ters. It may also alter the way the match position is advanced after a |
ters. It may also alter the way the match position is advanced after a |
| 1960 |
match failure for an unanchored pattern. |
match failure for an unanchored pattern. |
| 1961 |
|
|
| 1962 |
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
| 1963 |
set, and a match attempt for an unanchored pattern fails when the cur- |
set, and a match attempt for an unanchored pattern fails when the cur- |
| 1964 |
rent position is at a CRLF sequence, and the pattern contains no |
rent position is at a CRLF sequence, and the pattern contains no |
| 1965 |
explicit matches for CR or LF characters, the match position is |
explicit matches for CR or LF characters, the match position is |
| 1966 |
advanced by two characters instead of one, in other words, to after the |
advanced by two characters instead of one, in other words, to after the |
| 1967 |
CRLF. |
CRLF. |
| 1968 |
|
|
| 1969 |
The above rule is a compromise that makes the most common cases work as |
The above rule is a compromise that makes the most common cases work as |
| 1970 |
expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
| 1971 |
option is not set), it does not match the string "\r\nA" because, after |
option is not set), it does not match the string "\r\nA" because, after |
| 1972 |
failing at the start, it skips both the CR and the LF before retrying. |
failing at the start, it skips both the CR and the LF before retrying. |
| 1973 |
However, the pattern [\r\n]A does match that string, because it con- |
However, the pattern [\r\n]A does match that string, because it con- |
| 1974 |
tains an explicit CR or LF reference, and so advances only by one char- |
tains an explicit CR or LF reference, and so advances only by one char- |
| 1975 |
acter after the first failure. |
acter after the first failure. |
| 1976 |
|
|
| 1977 |
An explicit match for CR of LF is either a literal appearance of one of |
An explicit match for CR of LF is either a literal appearance of one of |
| 1978 |
those characters, or one of the \r or \n escape sequences. Implicit |
those characters, or one of the \r or \n escape sequences. Implicit |
| 1979 |
matches such as [^X] do not count, nor does \s (which includes CR and |
matches such as [^X] do not count, nor does \s (which includes CR and |
| 1980 |
LF in the characters that it matches). |
LF in the characters that it matches). |
| 1981 |
|
|
| 1982 |
Notwithstanding the above, anomalous effects may still occur when CRLF |
Notwithstanding the above, anomalous effects may still occur when CRLF |
| 1983 |
is a valid newline sequence and explicit \r or \n escapes appear in the |
is a valid newline sequence and explicit \r or \n escapes appear in the |
| 1984 |
pattern. |
pattern. |
| 1985 |
|
|
| 1986 |
PCRE_NOTBOL |
PCRE_NOTBOL |
| 1987 |
|
|
| 1988 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
| 1989 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
| 1990 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
| 1991 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
| 1992 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
| 1993 |
|
|
| 1994 |
PCRE_NOTEOL |
PCRE_NOTEOL |
| 1995 |
|
|
| 1996 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
| 1997 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
| 1998 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
| 1999 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
| 2000 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
| 2001 |
not affect \Z or \z. |
not affect \Z or \z. |
| 2002 |
|
|
| 2003 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
| 2004 |
|
|
| 2005 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
| 2006 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
| 2007 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
| 2008 |
example, if the pattern |
example, if the pattern |
| 2009 |
|
|
| 2010 |
a?b? |
a?b? |
| 2011 |
|
|
| 2012 |
is applied to a string not beginning with "a" or "b", it matches the |
is applied to a string not beginning with "a" or "b", it matches the |
| 2013 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
| 2014 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
| 2015 |
rences of "a" or "b". |
rences of "a" or "b". |
| 2016 |
|
|
| 2017 |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
| 2018 |
cial case of a pattern match of the empty string within its split() |
cial case of a pattern match of the empty string within its split() |
| 2019 |
function, and when using the /g modifier. It is possible to emulate |
function, and when using the /g modifier. It is possible to emulate |
| 2020 |
Perl's behaviour after matching a null string by first trying the match |
Perl's behaviour after matching a null string by first trying the match |
| 2021 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
| 2022 |
if that fails by advancing the starting offset (see below) and trying |
if that fails by advancing the starting offset (see below) and trying |
| 2023 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
| 2024 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
| 2025 |
|
|
| 2026 |
PCRE_NO_START_OPTIMIZE |
PCRE_NO_START_OPTIMIZE |
| 2027 |
|
|
| 2028 |
There are a number of optimizations that pcre_exec() uses at the start |
There are a number of optimizations that pcre_exec() uses at the start |
| 2029 |
of a match, in order to speed up the process. For example, if it is |
of a match, in order to speed up the process. For example, if it is |
| 2030 |
known that a match must start with a specific character, it searches |
known that a match must start with a specific character, it searches |
| 2031 |
the subject for that character, and fails immediately if it cannot find |
the subject for that character, and fails immediately if it cannot find |
| 2032 |
it, without actually running the main matching function. When callouts |
it, without actually running the main matching function. When callouts |
| 2033 |
are in use, these optimizations can cause them to be skipped. This |
are in use, these optimizations can cause them to be skipped. This |
| 2034 |
option disables the "start-up" optimizations, causing performance to |
option disables the "start-up" optimizations, causing performance to |
| 2035 |
suffer, but ensuring that the callouts do occur. |
suffer, but ensuring that the callouts do occur. |
| 2036 |
|
|
| 2037 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
| 2038 |
|
|
| 2039 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
| 2040 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
| 2041 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
| 2042 |
points to the start of a UTF-8 character. There is a discussion about |
points to the start of a UTF-8 character. There is a discussion about |
| 2043 |
the validity of UTF-8 strings in the section on UTF-8 support in the |
the validity of UTF-8 strings in the section on UTF-8 support in the |
| 2044 |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
| 2045 |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
| 2046 |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
| 2047 |
|
|
| 2048 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
| 2049 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
| 2050 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
| 2051 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
| 2052 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
| 2053 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
| 2054 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
| 2055 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
| 2056 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
| 2057 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
| 2058 |
|
|
| 2059 |
PCRE_PARTIAL |
PCRE_PARTIAL |
| 2060 |
|
|
| 2061 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
| 2062 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
| 2063 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
| 2064 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
| 2065 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
| 2066 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
| 2067 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
| 2068 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
| 2069 |
|
|
| 2070 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
| 2071 |
|
|
| 2072 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
| 2073 |
length (in bytes) in length, and a starting byte offset in startoffset. |
length (in bytes) in length, and a starting byte offset in startoffset. |
| 2074 |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
| 2075 |
acter. Unlike the pattern string, the subject may contain binary zero |
acter. Unlike the pattern string, the subject may contain binary zero |
| 2076 |
bytes. When the starting offset is zero, the search for a match starts |
bytes. When the starting offset is zero, the search for a match starts |
| 2077 |
at the beginning of the subject, and this is by far the most common |
at the beginning of the subject, and this is by far the most common |
| 2078 |
case. |
case. |
| 2079 |
|
|
| 2080 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
| 2081 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
| 2082 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
| 2083 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
| 2084 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
| 2085 |
|
|
| 2086 |
\Biss\B |
\Biss\B |
| 2087 |
|
|
| 2088 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
| 2089 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
| 2090 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
| 2091 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
| 2092 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
| 2093 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
| 2094 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
| 2095 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
| 2096 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
| 2097 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
| 2098 |
|
|
| 2099 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
| 2100 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
| 2101 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
| 2102 |
subject. |
subject. |
| 2103 |
|
|
| 2104 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
| 2105 |
|
|
| 2106 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
| 2107 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
| 2108 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
| 2109 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
| 2110 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
| 2111 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
| 2112 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
| 2113 |
|
|
| 2114 |
Captured substrings are returned to the caller via a vector of integers |
Captured substrings are returned to the caller via a vector of integers |
| 2115 |
whose address is passed in ovector. The number of elements in the vec- |
whose address is passed in ovector. The number of elements in the vec- |
| 2116 |
tor is passed in ovecsize, which must be a non-negative number. Note: |
tor is passed in ovecsize, which must be a non-negative number. Note: |
| 2117 |
this argument is NOT the size of ovector in bytes. |
this argument is NOT the size of ovector in bytes. |
| 2118 |
|
|
| 2119 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
| 2120 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
| 2121 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
| 2122 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
| 2123 |
The number passed in ovecsize should always be a multiple of three. If |
The number passed in ovecsize should always be a multiple of three. If |
| 2124 |
it is not, it is rounded down. |
it is not, it is rounded down. |
| 2125 |
|
|
| 2126 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
| 2127 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
| 2128 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
| 2129 |
element of each pair is set to the byte offset of the first character |
element of each pair is set to the byte offset of the first character |
| 2130 |
in a substring, and the second is set to the byte offset of the first |
in a substring, and the second is set to the byte offset of the first |
| 2131 |
character after the end of a substring. Note: these values are always |
character after the end of a substring. Note: these values are always |
| 2132 |
byte offsets, even in UTF-8 mode. They are not character counts. |
byte offsets, even in UTF-8 mode. They are not character counts. |
| 2133 |
|
|
| 2134 |
The first pair of integers, ovector[0] and ovector[1], identify the |
The first pair of integers, ovector[0] and ovector[1], identify the |
| 2135 |
portion of the subject string matched by the entire pattern. The next |
portion of the subject string matched by the entire pattern. The next |
| 2136 |
pair is used for the first capturing subpattern, and so on. The value |
pair is used for the first capturing subpattern, and so on. The value |
| 2137 |
returned by pcre_exec() is one more than the highest numbered pair that |
returned by pcre_exec() is one more than the highest numbered pair that |
| 2138 |
has been set. For example, if two substrings have been captured, the |
has been set. For example, if two substrings have been captured, the |
| 2139 |
returned value is 3. If there are no capturing subpatterns, the return |
returned value is 3. If there are no capturing subpatterns, the return |
| 2140 |
value from a successful match is 1, indicating that just the first pair |
value from a successful match is 1, indicating that just the first pair |
| 2141 |
of offsets has been set. |
of offsets has been set. |
| 2142 |
|
|
| 2143 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
| 2144 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
| 2145 |
|
|
| 2146 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
| 2147 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
| 2148 |
function returns a value of zero. If the substring offsets are not of |
function returns a value of zero. If the substring offsets are not of |
| 2149 |
interest, pcre_exec() may be called with ovector passed as NULL and |
interest, pcre_exec() may be called with ovector passed as NULL and |
| 2150 |
ovecsize as zero. However, if the pattern contains back references and |
ovecsize as zero. However, if the pattern contains back references and |
| 2151 |
the ovector is not big enough to remember the related substrings, PCRE |
the ovector is not big enough to remember the related substrings, PCRE |
| 2152 |
has to get additional memory for use during matching. Thus it is usu- |
has to get additional memory for use during matching. Thus it is usu- |
| 2153 |
ally advisable to supply an ovector. |
ally advisable to supply an ovector. |
| 2154 |
|
|
| 2155 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
| 2156 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
| 2157 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
| 2158 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
| 2159 |
|
|
| 2160 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
| 2161 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
| 2162 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
| 2163 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
| 2164 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
| 2165 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
| 2166 |
|
|
| 2167 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
| 2168 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
| 2169 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
| 2170 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
| 2171 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
| 2172 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
| 2173 |
the vector is large enough, of course). |
the vector is large enough, of course). |
| 2174 |
|
|
| 2175 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
| 2176 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
| 2177 |
|
|
| 2178 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
| 2179 |
|
|
| 2180 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
| 2181 |
defined in the header file: |
defined in the header file: |
| 2182 |
|
|
| 2183 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
| 2186 |
|
|
| 2187 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
| 2188 |
|
|
| 2189 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
| 2190 |
ovecsize was not zero. |
ovecsize was not zero. |
| 2191 |
|
|
| 2192 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
| 2195 |
|
|
| 2196 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
| 2197 |
|
|
| 2198 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
| 2199 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
| 2200 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
| 2201 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
| 2202 |
gives when the magic number is not present. |
gives when the magic number is not present. |
| 2203 |
|
|
| 2204 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
| 2205 |
|
|
| 2206 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
| 2207 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
| 2208 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
| 2209 |
|
|
| 2210 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2211 |
|
|
| 2212 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
| 2213 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
| 2214 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
| 2215 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
| 2216 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
| 2217 |
|
|
| 2218 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 2219 |
|
|
| 2220 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
| 2221 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
| 2222 |
returned by pcre_exec(). |
returned by pcre_exec(). |
| 2223 |
|
|
| 2224 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
| 2225 |
|
|
| 2226 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
| 2227 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
| 2228 |
above. |
above. |
| 2229 |
|
|
| 2230 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
| 2231 |
|
|
| 2232 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
| 2233 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
| 2234 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
| 2235 |
|
|
| 2236 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
| 2237 |
|
|
| 2238 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
| 2239 |
subject. |
subject. |
| 2240 |
|
|
| 2241 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
| 2242 |
|
|
| 2243 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
| 2244 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
| 2245 |
ter. |
ter. |
| 2246 |
|
|
| 2247 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
| 2248 |
|
|
| 2249 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
| 2250 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
| 2251 |
|
|
| 2252 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
| 2253 |
|
|
| 2254 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
| 2255 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
| 2256 |
documentation for details of partial matching. |
documentation for details of partial matching. |
| 2257 |
|
|
| 2258 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
| 2259 |
|
|
| 2260 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
| 2261 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
| 2262 |
|
|
| 2263 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
| 2267 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
| 2268 |
|
|
| 2269 |
The internal recursion limit, as specified by the match_limit_recursion |
The internal recursion limit, as specified by the match_limit_recursion |
| 2270 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
| 2271 |
description above. |
description above. |
| 2272 |
|
|
| 2273 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
| 2290 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
| 2291 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
| 2292 |
|
|
| 2293 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
| 2294 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
| 2295 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
| 2296 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
| 2297 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
| 2298 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
| 2299 |
substrings. |
substrings. |
| 2300 |
|
|
| 2301 |
A substring that contains a binary zero is correctly extracted and has |
A substring that contains a binary zero is correctly extracted and has |
| 2302 |
a further zero added on the end, but the result is not, of course, a C |
a further zero added on the end, but the result is not, of course, a C |
| 2303 |
string. However, you can process such a string by referring to the |
string. However, you can process such a string by referring to the |
| 2304 |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
| 2305 |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
| 2306 |
not adequate for handling strings containing binary zeros, because the |
not adequate for handling strings containing binary zeros, because the |
| 2307 |
end of the final string is not independently indicated. |
end of the final string is not independently indicated. |
| 2308 |
|
|
| 2309 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
| 2310 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
| 2311 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
| 2312 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
| 2313 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
| 2314 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
| 2315 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
| 2316 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
| 2317 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
| 2318 |
|
|
| 2319 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
| 2320 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
| 2321 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
| 2322 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
| 2323 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
| 2324 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
| 2325 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
| 2326 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
| 2327 |
the terminating zero, or one of these error codes: |
the terminating zero, or one of these error codes: |
| 2328 |
|
|
| 2329 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2330 |
|
|
| 2331 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
| 2332 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
| 2333 |
|
|
| 2334 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 2335 |
|
|
| 2336 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
| 2337 |
|
|
| 2338 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
| 2339 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
| 2340 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
| 2341 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
| 2342 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
| 2343 |
pointer. The yield of the function is zero if all went well, or the |
pointer. The yield of the function is zero if all went well, or the |
| 2344 |
error code |
error code |
| 2345 |
|
|
| 2346 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2347 |
|
|
| 2348 |
if the attempt to get the memory block failed. |
if the attempt to get the memory block failed. |
| 2349 |
|
|
| 2350 |
When any of these functions encounter a substring that is unset, which |
When any of these functions encounter a substring that is unset, which |
| 2351 |
can happen when capturing subpattern number n+1 matches some part of |
can happen when capturing subpattern number n+1 matches some part of |
| 2352 |
the subject, but subpattern n has not been used at all, they return an |
the subject, but subpattern n has not been used at all, they return an |
| 2353 |
empty string. This can be distinguished from a genuine zero-length sub- |
empty string. This can be distinguished from a genuine zero-length sub- |
| 2354 |
string by inspecting the appropriate offset in ovector, which is nega- |
string by inspecting the appropriate offset in ovector, which is nega- |
| 2355 |
tive for unset substrings. |
tive for unset substrings. |
| 2356 |
|
|
| 2357 |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
| 2358 |
string_list() can be used to free the memory returned by a previous |
string_list() can be used to free the memory returned by a previous |
| 2359 |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
| 2360 |
tively. They do nothing more than call the function pointed to by |
tively. They do nothing more than call the function pointed to by |
| 2361 |
pcre_free, which of course could be called directly from a C program. |
pcre_free, which of course could be called directly from a C program. |
| 2362 |
However, PCRE is used in some situations where it is linked via a spe- |
However, PCRE is used in some situations where it is linked via a spe- |
| 2363 |
cial interface to another programming language that cannot use |
cial interface to another programming language that cannot use |
| 2364 |
pcre_free directly; it is for these cases that the functions are pro- |
pcre_free directly; it is for these cases that the functions are pro- |
| 2365 |
vided. |
vided. |
| 2366 |
|
|
| 2367 |
|
|
| 2380 |
int stringcount, const char *stringname, |
int stringcount, const char *stringname, |
| 2381 |
const char **stringptr); |
const char **stringptr); |
| 2382 |
|
|
| 2383 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
| 2384 |
ber. For example, for this pattern |
ber. For example, for this pattern |
| 2385 |
|
|
| 2386 |
(a+)b(?<xxx>\d+)... |
(a+)b(?<xxx>\d+)... |
| 2389 |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
| 2390 |
name by calling pcre_get_stringnumber(). The first argument is the com- |
name by calling pcre_get_stringnumber(). The first argument is the com- |
| 2391 |
piled pattern, and the second is the name. The yield of the function is |
piled pattern, and the second is the name. The yield of the function is |
| 2392 |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
| 2393 |
subpattern of that name. |
subpattern of that name. |
| 2394 |
|
|
| 2395 |
Given the number, you can extract the substring directly, or use one of |
Given the number, you can extract the substring directly, or use one of |
| 2396 |
the functions described in the previous section. For convenience, there |
the functions described in the previous section. For convenience, there |
| 2397 |
are also two functions that do the whole job. |
are also two functions that do the whole job. |
| 2398 |
|
|
| 2399 |
Most of the arguments of pcre_copy_named_substring() and |
Most of the arguments of pcre_copy_named_substring() and |
| 2400 |
pcre_get_named_substring() are the same as those for the similarly |
pcre_get_named_substring() are the same as those for the similarly |
| 2401 |
named functions that extract by number. As these are described in the |
named functions that extract by number. As these are described in the |
| 2402 |
previous section, they are not re-described here. There are just two |
previous section, they are not re-described here. There are just two |
| 2403 |
differences: |
differences: |
| 2404 |
|
|
| 2405 |
First, instead of a substring number, a substring name is given. Sec- |
First, instead of a substring number, a substring name is given. Sec- |
| 2406 |
ond, there is an extra argument, given at the start, which is a pointer |
ond, there is an extra argument, given at the start, which is a pointer |
| 2407 |
to the compiled pattern. This is needed in order to gain access to the |
to the compiled pattern. This is needed in order to gain access to the |
| 2408 |
name-to-number translation table. |
name-to-number translation table. |
| 2409 |
|
|
| 2410 |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
| 2411 |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
| 2412 |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
| 2413 |
behaviour may not be what you want (see the next section). |
behaviour may not be what you want (see the next section). |
| 2414 |
|
|
| 2415 |
Warning: If the pattern uses the "(?|" feature to set up multiple sub- |
Warning: If the pattern uses the "(?|" feature to set up multiple sub- |
| 2416 |
patterns with the same number, you cannot use names to distinguish |
patterns with the same number, you cannot use names to distinguish |
| 2417 |
them, because names are not included in the compiled code. The matching |
them, because names are not included in the compiled code. The matching |
| 2418 |
process uses only numbers. |
process uses only numbers. |
| 2419 |
|
|
| 2423 |
int pcre_get_stringtable_entries(const pcre *code, |
int pcre_get_stringtable_entries(const pcre *code, |
| 2424 |
const char *name, char **first, char **last); |
const char *name, char **first, char **last); |
| 2425 |
|
|
| 2426 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
| 2427 |
subpatterns are not required to be unique. Normally, patterns with |
subpatterns are not required to be unique. Normally, patterns with |
| 2428 |
duplicate names are such that in any one match, only one of the named |
duplicate names are such that in any one match, only one of the named |
| 2429 |
subpatterns participates. An example is shown in the pcrepattern docu- |
subpatterns participates. An example is shown in the pcrepattern docu- |
| 2430 |
mentation. |
mentation. |
| 2431 |
|
|
| 2432 |
When duplicates are present, pcre_copy_named_substring() and |
When duplicates are present, pcre_copy_named_substring() and |
| 2433 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
| 2434 |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
| 2435 |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
| 2436 |
function returns one of the numbers that are associated with the name, |
function returns one of the numbers that are associated with the name, |
| 2437 |
but it is not defined which it is. |
but it is not defined which it is. |
| 2438 |
|
|
| 2439 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
| 2440 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
| 2441 |
first argument is the compiled pattern, and the second is the name. The |
first argument is the compiled pattern, and the second is the name. The |
| 2442 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
| 2443 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
| 2444 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
| 2445 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
| 2446 |
there are none. The format of the table is described above in the sec- |
there are none. The format of the table is described above in the sec- |
| 2447 |
tion entitled Information about a pattern. Given all the relevant |
tion entitled Information about a pattern. Given all the relevant |
| 2448 |
entries for the name, you can extract each of their numbers, and hence |
entries for the name, you can extract each of their numbers, and hence |
| 2449 |
the captured data, if any. |
the captured data, if any. |
| 2450 |
|
|
| 2451 |
|
|
| 2452 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
| 2453 |
|
|
| 2454 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
| 2455 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
| 2456 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
| 2457 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
| 2458 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
| 2459 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
| 2460 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
| 2461 |
tation. |
tation. |
| 2462 |
|
|
| 2463 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
| 2464 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
| 2465 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
| 2466 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
| 2467 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
| 2468 |
|
|
| 2469 |
|
|
| 2474 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
| 2475 |
int *workspace, int wscount); |
int *workspace, int wscount); |
| 2476 |
|
|
| 2477 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
| 2478 |
against a compiled pattern, using a matching algorithm that scans the |
against a compiled pattern, using a matching algorithm that scans the |
| 2479 |
subject string just once, and does not backtrack. This has different |
subject string just once, and does not backtrack. This has different |
| 2480 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
| 2481 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
| 2482 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
| 2483 |
a discussion of the two matching algorithms, see the pcrematching docu- |
a discussion of the two matching algorithms, see the pcrematching docu- |
| 2484 |
mentation. |
mentation. |
| 2485 |
|
|
| 2486 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
| 2487 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
| 2488 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
| 2489 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
| 2490 |
repeated here. |
repeated here. |
| 2491 |
|
|
| 2492 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
| 2493 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
| 2494 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
| 2495 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
| 2496 |
lot of potential matches. |
lot of potential matches. |
| 2497 |
|
|
| 2498 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
| 2514 |
|
|
| 2515 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
| 2516 |
|
|
| 2517 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
| 2518 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
| 2519 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
| 2520 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
| 2521 |
three of these are the same as for pcre_exec(), so their description is |
three of these are the same as for pcre_exec(), so their description is |
| 2522 |
not repeated here. |
not repeated here. |
| 2523 |
|
|
| 2524 |
PCRE_PARTIAL |
PCRE_PARTIAL |
| 2525 |
|
|
| 2526 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
| 2527 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
| 2528 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
| 2529 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
| 2530 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
| 2531 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
| 2532 |
set as the first matching string. |
set as the first matching string. |
| 2533 |
|
|
| 2534 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
| 2535 |
|
|
| 2536 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
| 2537 |
stop as soon as it has found one match. Because of the way the alterna- |
stop as soon as it has found one match. Because of the way the alterna- |
| 2538 |
tive algorithm works, this is necessarily the shortest possible match |
tive algorithm works, this is necessarily the shortest possible match |
| 2539 |
at the first possible matching point in the subject string. |
at the first possible matching point in the subject string. |
| 2540 |
|
|
| 2541 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
| 2542 |
|
|
| 2543 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
| 2544 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
| 2545 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
| 2546 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
| 2547 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
| 2548 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
| 2549 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
| 2550 |
documentation. |
documentation. |
| 2551 |
|
|
| 2552 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
| 2553 |
|
|
| 2554 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
| 2555 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
| 2556 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
| 2557 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
| 2558 |
if the pattern |
if the pattern |
| 2559 |
|
|
| 2560 |
<.*> |
<.*> |
| 2569 |
<something> <something else> |
<something> <something else> |
| 2570 |
<something> <something else> <something further> |
<something> <something else> <something further> |
| 2571 |
|
|
| 2572 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
| 2573 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
| 2574 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
| 2575 |
the offset to the start, and the second is the offset to the end. In |
the offset to the start, and the second is the offset to the end. In |
| 2576 |
fact, all the strings have the same start offset. (Space could have |
fact, all the strings have the same start offset. (Space could have |
| 2577 |
been saved by giving this only once, but it was decided to retain some |
been saved by giving this only once, but it was decided to retain some |
| 2578 |
compatibility with the way pcre_exec() returns data, even though the |
compatibility with the way pcre_exec() returns data, even though the |
| 2579 |
meaning of the strings is different.) |
meaning of the strings is different.) |
| 2580 |
|
|
| 2581 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
| 2582 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
| 2583 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
| 2584 |
filled with the longest matches. |
filled with the longest matches. |
| 2585 |
|
|
| 2586 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
| 2587 |
|
|
| 2588 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
| 2589 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
| 2590 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
| 2591 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
| 2592 |
|
|
| 2593 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
| 2594 |
|
|
| 2595 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
| 2596 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
| 2597 |
reference. |
reference. |
| 2598 |
|
|
| 2599 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
| 2600 |
|
|
| 2601 |
This return is given if pcre_dfa_exec() encounters a condition item |
This return is given if pcre_dfa_exec() encounters a condition item |
| 2602 |
that uses a back reference for the condition, or a test for recursion |
that uses a back reference for the condition, or a test for recursion |
| 2603 |
in a specific group. These are not supported. |
in a specific group. These are not supported. |
| 2604 |
|
|
| 2605 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
| 2606 |
|
|
| 2607 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
| 2608 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
| 2609 |
(it is meaningless). |
(it is meaningless). |
| 2610 |
|
|
| 2611 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
| 2612 |
|
|
| 2613 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
| 2614 |
workspace vector. |
workspace vector. |
| 2615 |
|
|
| 2616 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
| 2617 |
|
|
| 2618 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
| 2619 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
| 2620 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
| 2621 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
| 2622 |
|
|
| 2623 |
|
|
| 2624 |
SEE ALSO |
SEE ALSO |
| 2625 |
|
|
| 2626 |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
| 2627 |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
| 2628 |
|
|
| 2629 |
|
|
| 2636 |
|
|
| 2637 |
REVISION |
REVISION |
| 2638 |
|
|
| 2639 |
Last updated: 17 March 2009 |
Last updated: 11 April 2009 |
| 2640 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 2641 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2642 |
|
|
| 2985 |
The original operation of PCRE was on strings of one-byte characters. |
The original operation of PCRE was on strings of one-byte characters. |
| 2986 |
However, there is now also support for UTF-8 character strings. To use |
However, there is now also support for UTF-8 character strings. To use |
| 2987 |
this, you must build PCRE to include UTF-8 support, and then call |
this, you must build PCRE to include UTF-8 support, and then call |
| 2988 |
pcre_compile() with the PCRE_UTF8 option. How this affects pattern |
pcre_compile() with the PCRE_UTF8 option. There is also a special |
| 2989 |
matching is mentioned in several places below. There is also a summary |
sequence that can be given at the start of a pattern: |
| 2990 |
of UTF-8 features in the section on UTF-8 support in the main pcre |
|
| 2991 |
page. |
(*UTF8) |
| 2992 |
|
|
| 2993 |
|
Starting a pattern with this sequence is equivalent to setting the |
| 2994 |
|
PCRE_UTF8 option. This feature is not Perl-compatible. How setting |
| 2995 |
|
UTF-8 mode affects pattern matching is mentioned in several places |
| 2996 |
|
below. There is also a summary of UTF-8 features in the section on |
| 2997 |
|
UTF-8 support in the main pcre page. |
| 2998 |
|
|
| 2999 |
The remainder of this document discusses the patterns that are sup- |
The remainder of this document discusses the patterns that are sup- |
| 3000 |
ported by PCRE when its main matching function, pcre_exec(), is used. |
ported by PCRE when its main matching function, pcre_exec(), is used. |
| 3840 |
can be changed in the same way as the Perl-compatible options by using |
can be changed in the same way as the Perl-compatible options by using |
| 3841 |
the characters J, U and X respectively. |
the characters J, U and X respectively. |
| 3842 |
|
|
| 3843 |
When an option change occurs at top level (that is, not inside subpat- |
When one of these option changes occurs at top level (that is, not |
| 3844 |
tern parentheses), the change applies to the remainder of the pattern |
inside subpattern parentheses), the change applies to the remainder of |
| 3845 |
that follows. If the change is placed right at the start of a pattern, |
the pattern that follows. If the change is placed right at the start of |
| 3846 |
PCRE extracts it into the global options (and it will therefore show up |
a pattern, PCRE extracts it into the global options (and it will there- |
| 3847 |
in data extracted by the pcre_fullinfo() function). |
fore show up in data extracted by the pcre_fullinfo() function). |
| 3848 |
|
|
| 3849 |
An option change within a subpattern (see below for a description of |
An option change within a subpattern (see below for a description of |
| 3850 |
subpatterns) affects only that part of the current pattern that follows |
subpatterns) affects only that part of the current pattern that follows |
| 3867 |
|
|
| 3868 |
Note: There are other PCRE-specific options that can be set by the |
Note: There are other PCRE-specific options that can be set by the |
| 3869 |
application when the compile or match functions are called. In some |
application when the compile or match functions are called. In some |
| 3870 |
cases the pattern can contain special leading sequences to override |
cases the pattern can contain special leading sequences such as (*CRLF) |
| 3871 |
what the application has set or what has been defaulted. Details are |
to override what the application has set or what has been defaulted. |
| 3872 |
given in the section entitled "Newline sequences" above. |
Details are given in the section entitled "Newline sequences" above. |
| 3873 |
|
There is also the (*UTF8) leading sequence that can be used to set |
| 3874 |
|
UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option. |
| 3875 |
|
|
| 3876 |
|
|
| 3877 |
SUBPATTERNS |
SUBPATTERNS |
| 5031 |
|
|
| 5032 |
REVISION |
REVISION |
| 5033 |
|
|
| 5034 |
Last updated: 18 March 2009 |
Last updated: 11 April 2009 |
| 5035 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 5036 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5037 |
|
|
| 5144 |
SCRIPT NAMES FOR \p AND \P |
SCRIPT NAMES FOR \p AND \P |
| 5145 |
|
|
| 5146 |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
| 5147 |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cu- |
| 5148 |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
neiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, |
| 5149 |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, |
| 5150 |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, |
| 5151 |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, |
| 5152 |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, |
| 5153 |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurash- |
| 5154 |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
tra, Shavian, Sinhala, Sudanese, Syloti_Nagri, Syriac, Tagalog, Tag- |
| 5155 |
|
banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
| 5156 |
|
Ugaritic, Vai, Yi. |
| 5157 |
|
|
| 5158 |
|
|
| 5159 |
CHARACTER CLASSES |
CHARACTER CLASSES |
| 5205 |
|
|
| 5206 |
ANCHORS AND SIMPLE ASSERTIONS |
ANCHORS AND SIMPLE ASSERTIONS |
| 5207 |
|
|
| 5208 |
\b word boundary |
\b word boundary (only ASCII letters recognized) |
| 5209 |
\B not a word boundary |
\B not a word boundary |
| 5210 |
^ start of subject |
^ start of subject |
| 5211 |
also after internal newline in multiline mode |
also after internal newline in multiline mode |
| 5231 |
|
|
| 5232 |
CAPTURING |
CAPTURING |
| 5233 |
|
|
| 5234 |
(...) capturing group |
(...) capturing group |
| 5235 |
(?<name>...) named capturing group (Perl) |
(?<name>...) named capturing group (Perl) |
| 5236 |
(?'name'...) named capturing group (Perl) |
(?'name'...) named capturing group (Perl) |
| 5237 |
(?P<name>...) named capturing group (Python) |
(?P<name>...) named capturing group (Python) |
| 5238 |
(?:...) non-capturing group |
(?:...) non-capturing group |
| 5239 |
(?|...) non-capturing group; reset group numbers for |
(?|...) non-capturing group; reset group numbers for |
| 5240 |
capturing groups in each alternative |
capturing groups in each alternative |
| 5241 |
|
|
| 5242 |
|
|
| 5243 |
ATOMIC GROUPS |
ATOMIC GROUPS |
| 5244 |
|
|
| 5245 |
(?>...) atomic, non-capturing group |
(?>...) atomic, non-capturing group |
| 5246 |
|
|
| 5247 |
|
|
| 5248 |
COMMENT |
COMMENT |
| 5249 |
|
|
| 5250 |
(?#....) comment (not nestable) |
(?#....) comment (not nestable) |
| 5251 |
|
|
| 5252 |
|
|
| 5253 |
OPTION SETTING |
OPTION SETTING |
| 5254 |
|
|
| 5255 |
(?i) caseless |
(?i) caseless |
| 5256 |
(?J) allow duplicate names |
(?J) allow duplicate names |
| 5257 |
(?m) multiline |
(?m) multiline |
| 5258 |
(?s) single line (dotall) |
(?s) single line (dotall) |
| 5259 |
(?U) default ungreedy (lazy) |
(?U) default ungreedy (lazy) |
| 5260 |
(?x) extended (ignore white space) |
(?x) extended (ignore white space) |
| 5261 |
(?-...) unset option(s) |
(?-...) unset option(s) |
| 5262 |
|
|
| 5263 |
|
The following is recognized only at the start of a pattern or after one |
| 5264 |
|
of the newline-setting options with similar syntax: |
| 5265 |
|
|
| 5266 |
|
(*UTF8) set UTF-8 mode |
| 5267 |
|
|
| 5268 |
|
|
| 5269 |
LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
| 5270 |
|
|
| 5271 |
(?=...) positive look ahead |
(?=...) positive look ahead |
| 5272 |
(?!...) negative look ahead |
(?!...) negative look ahead |
| 5273 |
(?<=...) positive look behind |
(?<=...) positive look behind |
| 5274 |
(?<!...) negative look behind |
(?<!...) negative look behind |
| 5275 |
|
|
| 5276 |
Each top-level branch of a look behind must be of a fixed length. |
Each top-level branch of a look behind must be of a fixed length. |
| 5277 |
|
|
| 5278 |
|
|
| 5279 |
BACKREFERENCES |
BACKREFERENCES |
| 5280 |
|
|
| 5281 |
\n reference by number (can be ambiguous) |
\n reference by number (can be ambiguous) |
| 5282 |
\gn reference by number |
\gn reference by number |
| 5283 |
\g{n} reference by number |
\g{n} reference by number |
| 5284 |
\g{-n} relative reference by number |
\g{-n} relative reference by number |
| 5285 |
\k<name> reference by name (Perl) |
\k<name> reference by name (Perl) |
| 5286 |
\k'name' reference by name (Perl) |
\k'name' reference by name (Perl) |
| 5287 |
\g{name} reference by name (Perl) |
\g{name} reference by name (Perl) |
| 5288 |
\k{name} reference by name (.NET) |
\k{name} reference by name (.NET) |
| 5289 |
(?P=name) reference by name (Python) |
(?P=name) reference by name (Python) |
| 5290 |
|
|
| 5291 |
|
|
| 5292 |
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) |
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) |
| 5293 |
|
|
| 5294 |
(?R) recurse whole pattern |
(?R) recurse whole pattern |
| 5295 |
(?n) call subpattern by absolute number |
(?n) call subpattern by absolute number |
| 5296 |
(?+n) call subpattern by relative number |
(?+n) call subpattern by relative number |
| 5297 |
(?-n) call subpattern by relative number |
(?-n) call subpattern by relative number |
| 5298 |
(?&name) call subpattern by name (Perl) |
(?&name) call subpattern by name (Perl) |
| 5299 |
(?P>name) call subpattern by name (Python) |
(?P>name) call subpattern by name (Python) |
| 5300 |
\g<name> call subpattern by name (Oniguruma) |
\g<name> call subpattern by name (Oniguruma) |
| 5301 |
\g'name' call subpattern by name (Oniguruma) |
\g'name' call subpattern by name (Oniguruma) |
| 5302 |
\g<n> call subpattern by absolute number (Oniguruma) |
\g<n> call subpattern by absolute number (Oniguruma) |
| 5303 |
\g'n' call subpattern by absolute number (Oniguruma) |
\g'n' call subpattern by absolute number (Oniguruma) |
| 5304 |
\g<+n> call subpattern by relative number (PCRE extension) |
\g<+n> call subpattern by relative number (PCRE extension) |
| 5305 |
\g'+n' call subpattern by relative number (PCRE extension) |
\g'+n' call subpattern by relative number (PCRE extension) |
| 5306 |
\g<-n> call subpattern by relative number (PCRE extension) |
\g<-n> call subpattern by relative number (PCRE extension) |
| 5307 |
\g'-n' call subpattern by relative number (PCRE extension) |
\g'-n' call subpattern by relative number (PCRE extension) |
| 5308 |
|
|
| 5309 |
|
|
| 5310 |
CONDITIONAL PATTERNS |
CONDITIONAL PATTERNS |
| 5312 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
| 5313 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
| 5314 |
|
|
| 5315 |
(?(n)... absolute reference condition |
(?(n)... absolute reference condition |
| 5316 |
(?(+n)... relative reference condition |
(?(+n)... relative reference condition |
| 5317 |
(?(-n)... relative reference condition |
(?(-n)... relative reference condition |
| 5318 |
(?(<name>)... named reference condition (Perl) |
(?(<name>)... named reference condition (Perl) |
| 5319 |
(?('name')... named reference condition (Perl) |
(?('name')... named reference condition (Perl) |
| 5320 |
(?(name)... named reference condition (PCRE) |
(?(name)... named reference condition (PCRE) |
| 5321 |
(?(R)... overall recursion condition |
(?(R)... overall recursion condition |
| 5322 |
(?(Rn)... specific group recursion condition |
(?(Rn)... specific group recursion condition |
| 5323 |
(?(R&name)... specific recursion condition |
(?(R&name)... specific recursion condition |
| 5324 |
(?(DEFINE)... define subpattern for reference |
(?(DEFINE)... define subpattern for reference |
| 5325 |
(?(assert)... assertion condition |
(?(assert)... assertion condition |
| 5326 |
|
|
| 5327 |
|
|
| 5328 |
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
| 5329 |
|
|
| 5330 |
The following act immediately they are reached: |
The following act immediately they are reached: |
| 5331 |
|
|
| 5332 |
(*ACCEPT) force successful match |
(*ACCEPT) force successful match |
| 5333 |
(*FAIL) force backtrack; synonym (*F) |
(*FAIL) force backtrack; synonym (*F) |
| 5334 |
|
|
| 5335 |
The following act only when a subsequent match failure causes a back- |
The following act only when a subsequent match failure causes a back- |
| 5336 |
track to reach them. They all force a match failure, but they differ in |
track to reach them. They all force a match failure, but they differ in |
| 5337 |
what happens afterwards. Those that advance the start-of-match point do |
what happens afterwards. Those that advance the start-of-match point do |
| 5338 |
so only if the pattern is not anchored. |
so only if the pattern is not anchored. |
| 5339 |
|
|
| 5340 |
(*COMMIT) overall failure, no advance of starting point |
(*COMMIT) overall failure, no advance of starting point |
| 5341 |
(*PRUNE) advance to next starting character |
(*PRUNE) advance to next starting character |
| 5342 |
(*SKIP) advance start to current matching position |
(*SKIP) advance start to current matching position |
| 5343 |
(*THEN) local failure, backtrack to next alternation |
(*THEN) local failure, backtrack to next alternation |
| 5344 |
|
|
| 5345 |
|
|
| 5346 |
NEWLINE CONVENTIONS |
NEWLINE CONVENTIONS |
| 5347 |
|
|
| 5348 |
These are recognized only at the very start of the pattern or after a |
These are recognized only at the very start of the pattern or after a |
| 5349 |
(*BSR_...) option. |
(*BSR_...) or (*UTF8) option. |
| 5350 |
|
|
| 5351 |
(*CR) |
(*CR) carriage return only |
| 5352 |
(*LF) |
(*LF) linefeed only |
| 5353 |
(*CRLF) |
(*CRLF) carriage return followed by linefeed |
| 5354 |
(*ANYCRLF) |
(*ANYCRLF) all three of the above |
| 5355 |
(*ANY) |
(*ANY) any Unicode newline sequence |
| 5356 |
|
|
| 5357 |
|
|
| 5358 |
WHAT \R MATCHES |
WHAT \R MATCHES |
| 5359 |
|
|
| 5360 |
These are recognized only at the very start of the pattern or after a |
These are recognized only at the very start of the pattern or after a |
| 5361 |
(*...) option that sets the newline convention. |
(*...) option that sets the newline convention or UTF-8 mode. |
| 5362 |
|
|
| 5363 |
(*BSR_ANYCRLF) |
(*BSR_ANYCRLF) CR, LF, or CRLF |
| 5364 |
(*BSR_UNICODE) |
(*BSR_UNICODE) any Unicode newline sequence |
| 5365 |
|
|
| 5366 |
|
|
| 5367 |
CALLOUTS |
CALLOUTS |
| 5384 |
|
|
| 5385 |
REVISION |
REVISION |
| 5386 |
|
|
| 5387 |
Last updated: 09 April 2008 |
Last updated: 11 April 2009 |
| 5388 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 5389 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5390 |
|
|
| 5391 |
|
|