| 2122 |
|
|
| 2123 |
There are a number of optimizations that pcre_exec() uses at the start |
There are a number of optimizations that pcre_exec() uses at the start |
| 2124 |
of a match, in order to speed up the process. For example, if it is |
of a match, in order to speed up the process. For example, if it is |
| 2125 |
known that a match must start with a specific character, it searches |
known that an unanchored match must start with a specific character, it |
| 2126 |
the subject for that character, and fails immediately if it cannot find |
searches the subject for that character, and fails immediately if it |
| 2127 |
it, without actually running the main matching function. When callouts |
cannot find it, without actually running the main matching function. |
| 2128 |
are in use, these optimizations can cause them to be skipped. This |
This means that a special item such as (*COMMIT) at the start of a pat- |
| 2129 |
option disables the "start-up" optimizations, causing performance to |
tern is not considered until after a suitable starting point for the |
| 2130 |
suffer, but ensuring that the callouts do occur. |
match has been found. When callouts are in use, these "start-up" opti- |
| 2131 |
|
mizations can cause them to be skipped if the pattern is never actually |
| 2132 |
|
used. The PCRE_NO_START_OPTIMIZE option disables the start-up optimiza- |
| 2133 |
|
tions, causing performance to suffer, but ensuring that the callouts do |
| 2134 |
|
occur, and that items such as (*COMMIT) are considered at every possi- |
| 2135 |
|
ble starting position in the subject string. |
| 2136 |
|
|
| 2137 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
| 2138 |
|
|
| 2139 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
| 2140 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
| 2141 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
| 2142 |
points to the start of a UTF-8 character. There is a discussion about |
points to the start of a UTF-8 character. There is a discussion about |
| 2143 |
the validity of UTF-8 strings in the section on UTF-8 support in the |
the validity of UTF-8 strings in the section on UTF-8 support in the |
| 2144 |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
| 2145 |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
| 2146 |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
| 2147 |
|
|
| 2148 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
| 2149 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
| 2150 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
| 2151 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
| 2152 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
| 2153 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
| 2154 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
| 2155 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
| 2156 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
| 2157 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
| 2158 |
|
|
| 2159 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
| 2160 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
| 2161 |
|
|
| 2162 |
These options turn on the partial matching feature. For backwards com- |
These options turn on the partial matching feature. For backwards com- |
| 2163 |
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
| 2164 |
match occurs if the end of the subject string is reached successfully, |
match occurs if the end of the subject string is reached successfully, |
| 2165 |
but there are not enough subject characters to complete the match. If |
but there are not enough subject characters to complete the match. If |
| 2166 |
this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately |
this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately |
| 2167 |
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, |
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, |
| 2168 |
matching continues by testing any other alternatives. Only if they all |
matching continues by testing any other alternatives. Only if they all |
| 2169 |
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH). |
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH). |
| 2170 |
The portion of the string that was inspected when the partial match was |
The portion of the string that was inspected when the partial match was |
| 2171 |
found is set as the first matching string. There is a more detailed |
found is set as the first matching string. There is a more detailed |
| 2172 |
discussion in the pcrepartial documentation. |
discussion in the pcrepartial documentation. |
| 2173 |
|
|
| 2174 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
| 2175 |
|
|
| 2176 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
| 2177 |
length (in bytes) in length, and a starting byte offset in startoffset. |
length (in bytes) in length, and a starting byte offset in startoffset. |
| 2178 |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
| 2179 |
acter. Unlike the pattern string, the subject may contain binary zero |
acter. Unlike the pattern string, the subject may contain binary zero |
| 2180 |
bytes. When the starting offset is zero, the search for a match starts |
bytes. When the starting offset is zero, the search for a match starts |
| 2181 |
at the beginning of the subject, and this is by far the most common |
at the beginning of the subject, and this is by far the most common |
| 2182 |
case. |
case. |
| 2183 |
|
|
| 2184 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
| 2185 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
| 2186 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
| 2187 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
| 2188 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
| 2189 |
|
|
| 2190 |
\Biss\B |
\Biss\B |
| 2191 |
|
|
| 2192 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
| 2193 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
| 2194 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
| 2195 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
| 2196 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
| 2197 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
| 2198 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
| 2199 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
| 2200 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
| 2201 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
| 2202 |
|
|
| 2203 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
| 2204 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
| 2205 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
| 2206 |
subject. |
subject. |
| 2207 |
|
|
| 2208 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
| 2209 |
|
|
| 2210 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
| 2211 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
| 2212 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
| 2213 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
| 2214 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
| 2215 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
| 2216 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
| 2217 |
|
|
| 2218 |
Captured substrings are returned to the caller via a vector of integers |
Captured substrings are returned to the caller via a vector of integers |
| 2219 |
whose address is passed in ovector. The number of elements in the vec- |
whose address is passed in ovector. The number of elements in the vec- |
| 2220 |
tor is passed in ovecsize, which must be a non-negative number. Note: |
tor is passed in ovecsize, which must be a non-negative number. Note: |
| 2221 |
this argument is NOT the size of ovector in bytes. |
this argument is NOT the size of ovector in bytes. |
| 2222 |
|
|
| 2223 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
| 2224 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
| 2225 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
| 2226 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
| 2227 |
The number passed in ovecsize should always be a multiple of three. If |
The number passed in ovecsize should always be a multiple of three. If |
| 2228 |
it is not, it is rounded down. |
it is not, it is rounded down. |
| 2229 |
|
|
| 2230 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
| 2231 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
| 2232 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
| 2233 |
element of each pair is set to the byte offset of the first character |
element of each pair is set to the byte offset of the first character |
| 2234 |
in a substring, and the second is set to the byte offset of the first |
in a substring, and the second is set to the byte offset of the first |
| 2235 |
character after the end of a substring. Note: these values are always |
character after the end of a substring. Note: these values are always |
| 2236 |
byte offsets, even in UTF-8 mode. They are not character counts. |
byte offsets, even in UTF-8 mode. They are not character counts. |
| 2237 |
|
|
| 2238 |
The first pair of integers, ovector[0] and ovector[1], identify the |
The first pair of integers, ovector[0] and ovector[1], identify the |
| 2239 |
portion of the subject string matched by the entire pattern. The next |
portion of the subject string matched by the entire pattern. The next |
| 2240 |
pair is used for the first capturing subpattern, and so on. The value |
pair is used for the first capturing subpattern, and so on. The value |
| 2241 |
returned by pcre_exec() is one more than the highest numbered pair that |
returned by pcre_exec() is one more than the highest numbered pair that |
| 2242 |
has been set. For example, if two substrings have been captured, the |
has been set. For example, if two substrings have been captured, the |
| 2243 |
returned value is 3. If there are no capturing subpatterns, the return |
returned value is 3. If there are no capturing subpatterns, the return |
| 2244 |
value from a successful match is 1, indicating that just the first pair |
value from a successful match is 1, indicating that just the first pair |
| 2245 |
of offsets has been set. |
of offsets has been set. |
| 2246 |
|
|
| 2247 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
| 2248 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
| 2249 |
|
|
| 2250 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
| 2251 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
| 2252 |
function returns a value of zero. If the substring offsets are not of |
function returns a value of zero. If the substring offsets are not of |
| 2253 |
interest, pcre_exec() may be called with ovector passed as NULL and |
interest, pcre_exec() may be called with ovector passed as NULL and |
| 2254 |
ovecsize as zero. However, if the pattern contains back references and |
ovecsize as zero. However, if the pattern contains back references and |
| 2255 |
the ovector is not big enough to remember the related substrings, PCRE |
the ovector is not big enough to remember the related substrings, PCRE |
| 2256 |
has to get additional memory for use during matching. Thus it is usu- |
has to get additional memory for use during matching. Thus it is usu- |
| 2257 |
ally advisable to supply an ovector. |
ally advisable to supply an ovector. |
| 2258 |
|
|
| 2259 |
The pcre_fullinfo() function can be used to find out how many capturing |
The pcre_fullinfo() function can be used to find out how many capturing |
| 2260 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
| 2261 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
| 2262 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
| 2263 |
|
|
| 2264 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
| 2265 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
| 2266 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
| 2267 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
| 2268 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
| 2269 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
| 2270 |
|
|
| 2271 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
| 2272 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
| 2273 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
| 2274 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
| 2275 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
| 2276 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
| 2277 |
the vector is large enough, of course). |
the vector is large enough, of course). |
| 2278 |
|
|
| 2279 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
| 2280 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
| 2281 |
|
|
| 2282 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
| 2283 |
|
|
| 2284 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
| 2285 |
defined in the header file: |
defined in the header file: |
| 2286 |
|
|
| 2287 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
| 2290 |
|
|
| 2291 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
| 2292 |
|
|
| 2293 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
| 2294 |
ovecsize was not zero. |
ovecsize was not zero. |
| 2295 |
|
|
| 2296 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
| 2299 |
|
|
| 2300 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
| 2301 |
|
|
| 2302 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
| 2303 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
| 2304 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
| 2305 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
| 2306 |
gives when the magic number is not present. |
gives when the magic number is not present. |
| 2307 |
|
|
| 2308 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
| 2309 |
|
|
| 2310 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
| 2311 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
| 2312 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
| 2313 |
|
|
| 2314 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2315 |
|
|
| 2316 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
| 2317 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
| 2318 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
| 2319 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
| 2320 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
| 2321 |
|
|
| 2322 |
This error is also given if pcre_stack_malloc() fails in pcre_exec(). |
This error is also given if pcre_stack_malloc() fails in pcre_exec(). |
| 2323 |
This can happen only when PCRE has been compiled with --disable-stack- |
This can happen only when PCRE has been compiled with --disable-stack- |
| 2324 |
for-recursion. |
for-recursion. |
| 2325 |
|
|
| 2326 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 2327 |
|
|
| 2328 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
| 2329 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
| 2330 |
returned by pcre_exec(). |
returned by pcre_exec(). |
| 2331 |
|
|
| 2332 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
| 2333 |
|
|
| 2334 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
| 2335 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
| 2336 |
above. |
above. |
| 2337 |
|
|
| 2338 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
| 2339 |
|
|
| 2340 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
| 2341 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
| 2342 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
| 2343 |
|
|
| 2344 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
| 2345 |
|
|
| 2346 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
| 2347 |
subject. |
subject. |
| 2348 |
|
|
| 2349 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
| 2350 |
|
|
| 2351 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
| 2352 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
| 2353 |
ter. |
ter. |
| 2354 |
|
|
| 2355 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
| 2356 |
|
|
| 2357 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
| 2358 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
| 2359 |
|
|
| 2360 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
| 2361 |
|
|
| 2362 |
This code is no longer in use. It was formerly returned when the |
This code is no longer in use. It was formerly returned when the |
| 2363 |
PCRE_PARTIAL option was used with a compiled pattern containing items |
PCRE_PARTIAL option was used with a compiled pattern containing items |
| 2364 |
that were not supported for partial matching. From release 8.00 |
that were not supported for partial matching. From release 8.00 |
| 2365 |
onwards, there are no restrictions on partial matching. |
onwards, there are no restrictions on partial matching. |
| 2366 |
|
|
| 2367 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
| 2368 |
|
|
| 2369 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
| 2370 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
| 2371 |
|
|
| 2372 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
| 2376 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
| 2377 |
|
|
| 2378 |
The internal recursion limit, as specified by the match_limit_recursion |
The internal recursion limit, as specified by the match_limit_recursion |
| 2379 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
| 2380 |
description above. |
description above. |
| 2381 |
|
|
| 2382 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
| 2399 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
| 2400 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
| 2401 |
|
|
| 2402 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
| 2403 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
| 2404 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
| 2405 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
| 2406 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
| 2407 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
| 2408 |
substrings. |
substrings. |
| 2409 |
|
|
| 2410 |
A substring that contains a binary zero is correctly extracted and has |
A substring that contains a binary zero is correctly extracted and has |
| 2411 |
a further zero added on the end, but the result is not, of course, a C |
a further zero added on the end, but the result is not, of course, a C |
| 2412 |
string. However, you can process such a string by referring to the |
string. However, you can process such a string by referring to the |
| 2413 |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
| 2414 |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
| 2415 |
not adequate for handling strings containing binary zeros, because the |
not adequate for handling strings containing binary zeros, because the |
| 2416 |
end of the final string is not independently indicated. |
end of the final string is not independently indicated. |
| 2417 |
|
|
| 2418 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
| 2419 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
| 2420 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
| 2421 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
| 2422 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
| 2423 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
| 2424 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
| 2425 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
| 2426 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
| 2427 |
|
|
| 2428 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
| 2429 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
| 2430 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
| 2431 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
| 2432 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
| 2433 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
| 2434 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
| 2435 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
| 2436 |
the terminating zero, or one of these error codes: |
the terminating zero, or one of these error codes: |
| 2437 |
|
|
| 2438 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2439 |
|
|
| 2440 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
| 2441 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
| 2442 |
|
|
| 2443 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 2444 |
|
|
| 2445 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
| 2446 |
|
|
| 2447 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
| 2448 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
| 2449 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
| 2450 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
| 2451 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
| 2452 |
pointer. The yield of the function is zero if all went well, or the |
pointer. The yield of the function is zero if all went well, or the |
| 2453 |
error code |
error code |
| 2454 |
|
|
| 2455 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
| 2456 |
|
|
| 2457 |
if the attempt to get the memory block failed. |
if the attempt to get the memory block failed. |
| 2458 |
|
|
| 2459 |
When any of these functions encounter a substring that is unset, which |
When any of these functions encounter a substring that is unset, which |
| 2460 |
can happen when capturing subpattern number n+1 matches some part of |
can happen when capturing subpattern number n+1 matches some part of |
| 2461 |
the subject, but subpattern n has not been used at all, they return an |
the subject, but subpattern n has not been used at all, they return an |
| 2462 |
empty string. This can be distinguished from a genuine zero-length sub- |
empty string. This can be distinguished from a genuine zero-length sub- |
| 2463 |
string by inspecting the appropriate offset in ovector, which is nega- |
string by inspecting the appropriate offset in ovector, which is nega- |
| 2464 |
tive for unset substrings. |
tive for unset substrings. |
| 2465 |
|
|
| 2466 |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
| 2467 |
string_list() can be used to free the memory returned by a previous |
string_list() can be used to free the memory returned by a previous |
| 2468 |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
| 2469 |
tively. They do nothing more than call the function pointed to by |
tively. They do nothing more than call the function pointed to by |
| 2470 |
pcre_free, which of course could be called directly from a C program. |
pcre_free, which of course could be called directly from a C program. |
| 2471 |
However, PCRE is used in some situations where it is linked via a spe- |
However, PCRE is used in some situations where it is linked via a spe- |
| 2472 |
cial interface to another programming language that cannot use |
cial interface to another programming language that cannot use |
| 2473 |
pcre_free directly; it is for these cases that the functions are pro- |
pcre_free directly; it is for these cases that the functions are pro- |
| 2474 |
vided. |
vided. |
| 2475 |
|
|
| 2476 |
|
|
| 2489 |
int stringcount, const char *stringname, |
int stringcount, const char *stringname, |
| 2490 |
const char **stringptr); |
const char **stringptr); |
| 2491 |
|
|
| 2492 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
| 2493 |
ber. For example, for this pattern |
ber. For example, for this pattern |
| 2494 |
|
|
| 2495 |
(a+)b(?<xxx>\d+)... |
(a+)b(?<xxx>\d+)... |
| 2498 |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
| 2499 |
name by calling pcre_get_stringnumber(). The first argument is the com- |
name by calling pcre_get_stringnumber(). The first argument is the com- |
| 2500 |
piled pattern, and the second is the name. The yield of the function is |
piled pattern, and the second is the name. The yield of the function is |
| 2501 |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
| 2502 |
subpattern of that name. |
subpattern of that name. |
| 2503 |
|
|
| 2504 |
Given the number, you can extract the substring directly, or use one of |
Given the number, you can extract the substring directly, or use one of |
| 2505 |
the functions described in the previous section. For convenience, there |
the functions described in the previous section. For convenience, there |
| 2506 |
are also two functions that do the whole job. |
are also two functions that do the whole job. |
| 2507 |
|
|
| 2508 |
Most of the arguments of pcre_copy_named_substring() and |
Most of the arguments of pcre_copy_named_substring() and |
| 2509 |
pcre_get_named_substring() are the same as those for the similarly |
pcre_get_named_substring() are the same as those for the similarly |
| 2510 |
named functions that extract by number. As these are described in the |
named functions that extract by number. As these are described in the |
| 2511 |
previous section, they are not re-described here. There are just two |
previous section, they are not re-described here. There are just two |
| 2512 |
differences: |
differences: |
| 2513 |
|
|
| 2514 |
First, instead of a substring number, a substring name is given. Sec- |
First, instead of a substring number, a substring name is given. Sec- |
| 2515 |
ond, there is an extra argument, given at the start, which is a pointer |
ond, there is an extra argument, given at the start, which is a pointer |
| 2516 |
to the compiled pattern. This is needed in order to gain access to the |
to the compiled pattern. This is needed in order to gain access to the |
| 2517 |
name-to-number translation table. |
name-to-number translation table. |
| 2518 |
|
|
| 2519 |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
| 2520 |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
| 2521 |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
| 2522 |
behaviour may not be what you want (see the next section). |
behaviour may not be what you want (see the next section). |
| 2523 |
|
|
| 2524 |
Warning: If the pattern uses the (?| feature to set up multiple subpat- |
Warning: If the pattern uses the (?| feature to set up multiple subpat- |
| 2525 |
terns with the same number, as described in the section on duplicate |
terns with the same number, as described in the section on duplicate |
| 2526 |
subpattern numbers in the pcrepattern page, you cannot use names to |
subpattern numbers in the pcrepattern page, you cannot use names to |
| 2527 |
distinguish the different subpatterns, because names are not included |
distinguish the different subpatterns, because names are not included |
| 2528 |
in the compiled code. The matching process uses only numbers. For this |
in the compiled code. The matching process uses only numbers. For this |
| 2529 |
reason, the use of different names for subpatterns of the same number |
reason, the use of different names for subpatterns of the same number |
| 2530 |
causes an error at compile time. |
causes an error at compile time. |
| 2531 |
|
|
| 2532 |
|
|
| 2535 |
int pcre_get_stringtable_entries(const pcre *code, |
int pcre_get_stringtable_entries(const pcre *code, |
| 2536 |
const char *name, char **first, char **last); |
const char *name, char **first, char **last); |
| 2537 |
|
|
| 2538 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
| 2539 |
subpatterns are not required to be unique. (Duplicate names are always |
subpatterns are not required to be unique. (Duplicate names are always |
| 2540 |
allowed for subpatterns with the same number, created by using the (?| |
allowed for subpatterns with the same number, created by using the (?| |
| 2541 |
feature. Indeed, if such subpatterns are named, they are required to |
feature. Indeed, if such subpatterns are named, they are required to |
| 2542 |
use the same names.) |
use the same names.) |
| 2543 |
|
|
| 2544 |
Normally, patterns with duplicate names are such that in any one match, |
Normally, patterns with duplicate names are such that in any one match, |
| 2545 |
only one of the named subpatterns participates. An example is shown in |
only one of the named subpatterns participates. An example is shown in |
| 2546 |
the pcrepattern documentation. |
the pcrepattern documentation. |
| 2547 |
|
|
| 2548 |
When duplicates are present, pcre_copy_named_substring() and |
When duplicates are present, pcre_copy_named_substring() and |
| 2549 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
| 2550 |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
| 2551 |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
| 2552 |
function returns one of the numbers that are associated with the name, |
function returns one of the numbers that are associated with the name, |
| 2553 |
but it is not defined which it is. |
but it is not defined which it is. |
| 2554 |
|
|
| 2555 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
| 2556 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
| 2557 |
first argument is the compiled pattern, and the second is the name. The |
first argument is the compiled pattern, and the second is the name. The |
| 2558 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
| 2559 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
| 2560 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
| 2561 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
| 2562 |
there are none. The format of the table is described above in the sec- |
there are none. The format of the table is described above in the sec- |
| 2563 |
tion entitled Information about a pattern. Given all the relevant |
tion entitled Information about a pattern. Given all the relevant |
| 2564 |
entries for the name, you can extract each of their numbers, and hence |
entries for the name, you can extract each of their numbers, and hence |
| 2565 |
the captured data, if any. |
the captured data, if any. |
| 2566 |
|
|
| 2567 |
|
|
| 2568 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
| 2569 |
|
|
| 2570 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
| 2571 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
| 2572 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
| 2573 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
| 2574 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
| 2575 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
| 2576 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
| 2577 |
tation. |
tation. |
| 2578 |
|
|
| 2579 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
| 2580 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
| 2581 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
| 2582 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
| 2583 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
| 2584 |
|
|
| 2585 |
|
|
| 2590 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
| 2591 |
int *workspace, int wscount); |
int *workspace, int wscount); |
| 2592 |
|
|
| 2593 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
| 2594 |
against a compiled pattern, using a matching algorithm that scans the |
against a compiled pattern, using a matching algorithm that scans the |
| 2595 |
subject string just once, and does not backtrack. This has different |
subject string just once, and does not backtrack. This has different |
| 2596 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
| 2597 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
| 2598 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
| 2599 |
a discussion of the two matching algorithms, and a list of features |
a discussion of the two matching algorithms, and a list of features |
| 2600 |
that pcre_dfa_exec() does not support, see the pcrematching documenta- |
that pcre_dfa_exec() does not support, see the pcrematching documenta- |
| 2601 |
tion. |
tion. |
| 2602 |
|
|
| 2603 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
| 2604 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
| 2605 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
| 2606 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
| 2607 |
repeated here. |
repeated here. |
| 2608 |
|
|
| 2609 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
| 2610 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
| 2611 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
| 2612 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
| 2613 |
lot of potential matches. |
lot of potential matches. |
| 2614 |
|
|
| 2615 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
| 2631 |
|
|
| 2632 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
| 2633 |
|
|
| 2634 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
| 2635 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
| 2636 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, |
| 2637 |
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR- |
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, |
| 2638 |
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR- |
| 2639 |
|
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
| 2640 |
four of these are exactly the same as for pcre_exec(), so their |
four of these are exactly the same as for pcre_exec(), so their |
| 2641 |
description is not repeated here. |
description is not repeated here. |
| 2642 |
|
|
| 2759 |
|
|
| 2760 |
REVISION |
REVISION |
| 2761 |
|
|
| 2762 |
Last updated: 01 June 2010 |
Last updated: 15 June 2010 |
| 2763 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
| 2764 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2765 |
|
|