| 140 |
.\" HREF |
.\" HREF |
| 141 |
\fBpcresample\fP |
\fBpcresample\fP |
| 142 |
.\" |
.\" |
| 143 |
documentation describes how to run it. |
documentation describes how to compile and run it. |
| 144 |
.P |
.P |
| 145 |
A second matching function, \fBpcre_dfa_exec()\fP, which is not |
A second matching function, \fBpcre_dfa_exec()\fP, which is not |
| 146 |
Perl-compatible, is also provided. This uses a different algorithm for the |
Perl-compatible, is also provided. This uses a different algorithm for the |
| 254 |
.\" </a> |
.\" </a> |
| 255 |
section on \fBpcre_exec()\fP options |
section on \fBpcre_exec()\fP options |
| 256 |
.\" |
.\" |
| 257 |
below. The choice of newline convention does not affect the interpretation of |
below. |
| 258 |
the \en or \er escape sequences. |
.P |
| 259 |
|
The choice of newline convention does not affect the interpretation of |
| 260 |
|
the \en or \er escape sequences, nor does it affect what \eR matches, which is |
| 261 |
|
controlled in a similar way, but by separate options. |
| 262 |
. |
. |
| 263 |
. |
. |
| 264 |
.SH MULTITHREADING |
.SH MULTITHREADING |
| 317 |
.sp |
.sp |
| 318 |
The output is an integer whose value specifies the default character sequence |
The output is an integer whose value specifies the default character sequence |
| 319 |
that is recognized as meaning "newline". The four values that are supported |
that is recognized as meaning "newline". The four values that are supported |
| 320 |
are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, and -1 for ANY. The |
are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, and -1 for ANY. |
| 321 |
default should normally be the standard sequence for your operating system. |
Though they are derived from ASCII, the same values are returned in EBCDIC |
| 322 |
|
environments. The default should normally correspond to the standard sequence |
| 323 |
|
for your operating system. |
| 324 |
|
.sp |
| 325 |
|
PCRE_CONFIG_BSR |
| 326 |
|
.sp |
| 327 |
|
The output is an integer whose value indicates what character sequences the \eR |
| 328 |
|
escape sequence matches by default. A value of 0 means that \eR matches any |
| 329 |
|
Unicode line ending sequence; a value of 1 means that \eR matches only CR, LF, |
| 330 |
|
or CRLF. The default can be overridden when a pattern is compiled or matched. |
| 331 |
.sp |
.sp |
| 332 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
| 333 |
.sp |
.sp |
| 349 |
.sp |
.sp |
| 350 |
PCRE_CONFIG_MATCH_LIMIT |
PCRE_CONFIG_MATCH_LIMIT |
| 351 |
.sp |
.sp |
| 352 |
The output is an integer that gives the default limit for the number of |
The output is a long integer that gives the default limit for the number of |
| 353 |
internal matching function calls in a \fBpcre_exec()\fP execution. Further |
internal matching function calls in a \fBpcre_exec()\fP execution. Further |
| 354 |
details are given with \fBpcre_exec()\fP below. |
details are given with \fBpcre_exec()\fP below. |
| 355 |
.sp |
.sp |
| 356 |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
| 357 |
.sp |
.sp |
| 358 |
The output is an integer that gives the default limit for the depth of |
The output is a long integer that gives the default limit for the depth of |
| 359 |
recursion when calling the internal matching function in a \fBpcre_exec()\fP |
recursion when calling the internal matching function in a \fBpcre_exec()\fP |
| 360 |
execution. Further details are given with \fBpcre_exec()\fP below. |
execution. Further details are given with \fBpcre_exec()\fP below. |
| 361 |
.sp |
.sp |
| 470 |
.\" |
.\" |
| 471 |
documentation. |
documentation. |
| 472 |
.sp |
.sp |
| 473 |
|
PCRE_BSR_ANYCRLF |
| 474 |
|
PCRE_BSR_UNICODE |
| 475 |
|
.sp |
| 476 |
|
These options (which are mutually exclusive) control what the \eR escape |
| 477 |
|
sequence matches. The choice is either to match only CR, LF, or CRLF, or to |
| 478 |
|
match any Unicode newline sequence. The default is specified when PCRE is |
| 479 |
|
built. It can be overridden from within the pattern, or by setting an option |
| 480 |
|
when a compiled pattern is matched. |
| 481 |
|
.sp |
| 482 |
PCRE_CASELESS |
PCRE_CASELESS |
| 483 |
.sp |
.sp |
| 484 |
If this bit is set, letters in the pattern match both upper and lower case |
If this bit is set, letters in the pattern match both upper and lower case |
| 551 |
the first newline in the subject string, though the matched text may continue |
the first newline in the subject string, though the matched text may continue |
| 552 |
over the newline. |
over the newline. |
| 553 |
.sp |
.sp |
| 554 |
|
PCRE_JAVASCRIPT_COMPAT |
| 555 |
|
.sp |
| 556 |
|
If this option is set, PCRE's behaviour is changed in some ways so that it is |
| 557 |
|
compatible with JavaScript rather than Perl. The changes are as follows: |
| 558 |
|
.P |
| 559 |
|
(1) A lone closing square bracket in a pattern causes a compile-time error, |
| 560 |
|
because this is illegal in JavaScript (by default it is treated as a data |
| 561 |
|
character). Thus, the pattern AB]CD becomes illegal when this option is set. |
| 562 |
|
.P |
| 563 |
|
(2) At run time, a back reference to an unset subpattern group matches an empty |
| 564 |
|
string (by default this causes the current matching alternative to fail). A |
| 565 |
|
pattern such as (\e1)(a) succeeds when this option is set (assuming it can find |
| 566 |
|
an "a" in the subject), whereas it fails by default, for Perl compatibility. |
| 567 |
|
.sp |
| 568 |
PCRE_MULTILINE |
PCRE_MULTILINE |
| 569 |
.sp |
.sp |
| 570 |
By default, PCRE treats the subject string as consisting of a single line of |
By default, PCRE treats the subject string as consisting of a single line of |
| 688 |
9 nothing to repeat |
9 nothing to repeat |
| 689 |
10 [this code is not in use] |
10 [this code is not in use] |
| 690 |
11 internal error: unexpected repeat |
11 internal error: unexpected repeat |
| 691 |
12 unrecognized character after (? |
12 unrecognized character after (? or (?- |
| 692 |
13 POSIX named classes are supported only within a class |
13 POSIX named classes are supported only within a class |
| 693 |
14 missing ) |
14 missing ) |
| 694 |
15 reference to non-existent subpattern |
15 reference to non-existent subpattern |
| 696 |
17 unknown option bit(s) set |
17 unknown option bit(s) set |
| 697 |
18 missing ) after comment |
18 missing ) after comment |
| 698 |
19 [this code is not in use] |
19 [this code is not in use] |
| 699 |
20 regular expression too large |
20 regular expression is too large |
| 700 |
21 failed to get memory |
21 failed to get memory |
| 701 |
22 unmatched parentheses |
22 unmatched parentheses |
| 702 |
23 internal error: code overflow |
23 internal error: code overflow |
| 725 |
46 malformed \eP or \ep sequence |
46 malformed \eP or \ep sequence |
| 726 |
47 unknown property name after \eP or \ep |
47 unknown property name after \eP or \ep |
| 727 |
48 subpattern name is too long (maximum 32 characters) |
48 subpattern name is too long (maximum 32 characters) |
| 728 |
49 too many named subpatterns (maximum 10,000) |
49 too many named subpatterns (maximum 10000) |
| 729 |
50 [this code is not in use] |
50 [this code is not in use] |
| 730 |
51 octal value is greater than \e377 (not in UTF-8 mode) |
51 octal value is greater than \e377 (not in UTF-8 mode) |
| 731 |
52 internal error: overran compiling workspace |
52 internal error: overran compiling workspace |
| 732 |
53 internal error: previously-checked referenced subpattern not found |
53 internal error: previously-checked referenced subpattern not found |
| 733 |
54 DEFINE group contains more than one branch |
54 DEFINE group contains more than one branch |
| 734 |
55 repeating a DEFINE group is not allowed |
55 repeating a DEFINE group is not allowed |
| 735 |
56 inconsistent NEWLINE options" |
56 inconsistent NEWLINE options |
| 736 |
57 \eg is not followed by a braced name or an optionally braced |
57 \eg is not followed by a braced, angle-bracketed, or quoted |
| 737 |
non-zero number |
name/number or by a plain number |
| 738 |
58 (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number |
58 a numbered reference must not be zero |
| 739 |
|
59 (*VERB) with an argument is not supported |
| 740 |
|
60 (*VERB) not recognized |
| 741 |
|
61 number is too big |
| 742 |
|
62 subpattern name expected |
| 743 |
|
63 digit expected after (?+ |
| 744 |
|
64 ] is an invalid data character in JavaScript compatibility mode |
| 745 |
|
.sp |
| 746 |
|
The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may |
| 747 |
|
be used if the limits were changed when PCRE was built. |
| 748 |
. |
. |
| 749 |
. |
. |
| 750 |
.SH "STUDYING A PATTERN" |
.SH "STUDYING A PATTERN" |
| 943 |
PCRE_INFO_HASCRORLF |
PCRE_INFO_HASCRORLF |
| 944 |
.sp |
.sp |
| 945 |
Return 1 if the pattern contains any explicit matches for CR or LF characters, |
Return 1 if the pattern contains any explicit matches for CR or LF characters, |
| 946 |
otherwise 0. The fourth argument should point to an \fBint\fP variable. |
otherwise 0. The fourth argument should point to an \fBint\fP variable. An |
| 947 |
|
explicit match is either a literal CR or LF character, or \er or \en. |
| 948 |
.sp |
.sp |
| 949 |
PCRE_INFO_JCHANGED |
PCRE_INFO_JCHANGED |
| 950 |
.sp |
.sp |
| 951 |
Return 1 if the (?J) option setting is used in the pattern, otherwise 0. The |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise |
| 952 |
fourth argument should point to an \fBint\fP variable. The (?J) internal option |
0. The fourth argument should point to an \fBint\fP variable. (?J) and |
| 953 |
setting changes the local PCRE_DUPNAMES option. |
(?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
| 954 |
.sp |
.sp |
| 955 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
| 956 |
.sp |
.sp |
| 1239 |
.sp |
.sp |
| 1240 |
The unused bits of the \fIoptions\fP argument for \fBpcre_exec()\fP must be |
The unused bits of the \fIoptions\fP argument for \fBpcre_exec()\fP must be |
| 1241 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_\fIxxx\fP, |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_\fIxxx\fP, |
| 1242 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE, |
| 1243 |
|
PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
| 1244 |
.sp |
.sp |
| 1245 |
PCRE_ANCHORED |
PCRE_ANCHORED |
| 1246 |
.sp |
.sp |
| 1249 |
to be anchored by virtue of its contents, it cannot be made unachored at |
to be anchored by virtue of its contents, it cannot be made unachored at |
| 1250 |
matching time. |
matching time. |
| 1251 |
.sp |
.sp |
| 1252 |
|
PCRE_BSR_ANYCRLF |
| 1253 |
|
PCRE_BSR_UNICODE |
| 1254 |
|
.sp |
| 1255 |
|
These options (which are mutually exclusive) control what the \eR escape |
| 1256 |
|
sequence matches. The choice is either to match only CR, LF, or CRLF, or to |
| 1257 |
|
match any Unicode newline sequence. These options override the choice that was |
| 1258 |
|
made or defaulted when the pattern was compiled. |
| 1259 |
|
.sp |
| 1260 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
| 1261 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
| 1262 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
| 1283 |
[\er\en]A does match that string, because it contains an explicit CR or LF |
[\er\en]A does match that string, because it contains an explicit CR or LF |
| 1284 |
reference, and so advances only by one character after the first failure. |
reference, and so advances only by one character after the first failure. |
| 1285 |
.P |
.P |
| 1286 |
An explicit match for CR of LF is either a literal appearance of one of those |
An explicit match for CR of LF is either a literal appearance of one of those |
| 1287 |
characters, or one of the \er or \en escape sequences. Implicit matches such as |
characters, or one of the \er or \en escape sequences. Implicit matches such as |
| 1288 |
[^X] do not count, nor does \es (which includes CR and LF in the characters |
[^X] do not count, nor does \es (which includes CR and LF in the characters |
| 1289 |
that it matches). |
that it matches). |
| 1290 |
.P |
.P |
| 1327 |
starting offset (see below) and trying an ordinary match again. There is some |
starting offset (see below) and trying an ordinary match again. There is some |
| 1328 |
code that demonstrates how to do this in the \fIpcredemo.c\fP sample program. |
code that demonstrates how to do this in the \fIpcredemo.c\fP sample program. |
| 1329 |
.sp |
.sp |
| 1330 |
|
PCRE_NO_START_OPTIMIZE |
| 1331 |
|
.sp |
| 1332 |
|
There are a number of optimizations that \fBpcre_exec()\fP uses at the start of |
| 1333 |
|
a match, in order to speed up the process. For example, if it is known that a |
| 1334 |
|
match must start with a specific character, it searches the subject for that |
| 1335 |
|
character, and fails immediately if it cannot find it, without actually running |
| 1336 |
|
the main matching function. When callouts are in use, these optimizations can |
| 1337 |
|
cause them to be skipped. This option disables the "start-up" optimizations, |
| 1338 |
|
causing performance to suffer, but ensuring that the callouts do occur. |
| 1339 |
|
.sp |
| 1340 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
| 1341 |
.sp |
.sp |
| 1342 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a UTF-8 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a UTF-8 |
| 1384 |
.rs |
.rs |
| 1385 |
.sp |
.sp |
| 1386 |
The subject string is passed to \fBpcre_exec()\fP as a pointer in |
The subject string is passed to \fBpcre_exec()\fP as a pointer in |
| 1387 |
\fIsubject\fP, a length in \fIlength\fP, and a starting byte offset in |
\fIsubject\fP, a length (in bytes) in \fIlength\fP, and a starting byte offset |
| 1388 |
\fIstartoffset\fP. In UTF-8 mode, the byte offset must point to the start of a |
in \fIstartoffset\fP. In UTF-8 mode, the byte offset must point to the start of |
| 1389 |
UTF-8 character. Unlike the pattern string, the subject may contain binary zero |
a UTF-8 character. Unlike the pattern string, the subject may contain binary |
| 1390 |
bytes. When the starting offset is zero, the search for a match starts at the |
zero bytes. When the starting offset is zero, the search for a match starts at |
| 1391 |
beginning of the subject, and this is by far the most common case. |
the beginning of the subject, and this is by far the most common case. |
| 1392 |
.P |
.P |
| 1393 |
A non-zero starting offset is useful when searching for another match in the |
A non-zero starting offset is useful when searching for another match in the |
| 1394 |
same subject by calling \fBpcre_exec()\fP again after a previous success. |
same subject by calling \fBpcre_exec()\fP again after a previous success. |
| 1422 |
a fragment of a pattern that picks out a substring. PCRE supports several other |
a fragment of a pattern that picks out a substring. PCRE supports several other |
| 1423 |
kinds of parenthesized subpattern that do not cause substrings to be captured. |
kinds of parenthesized subpattern that do not cause substrings to be captured. |
| 1424 |
.P |
.P |
| 1425 |
Captured substrings are returned to the caller via a vector of integer offsets |
Captured substrings are returned to the caller via a vector of integers whose |
| 1426 |
whose address is passed in \fIovector\fP. The number of elements in the vector |
address is passed in \fIovector\fP. The number of elements in the vector is |
| 1427 |
is passed in \fIovecsize\fP, which must be a non-negative number. \fBNote\fP: |
passed in \fIovecsize\fP, which must be a non-negative number. \fBNote\fP: this |
| 1428 |
this argument is NOT the size of \fIovector\fP in bytes. |
argument is NOT the size of \fIovector\fP in bytes. |
| 1429 |
.P |
.P |
| 1430 |
The first two-thirds of the vector is used to pass back captured substrings, |
The first two-thirds of the vector is used to pass back captured substrings, |
| 1431 |
each substring using a pair of integers. The remaining third of the vector is |
each substring using a pair of integers. The remaining third of the vector is |
| 1432 |
used as workspace by \fBpcre_exec()\fP while matching capturing subpatterns, |
used as workspace by \fBpcre_exec()\fP while matching capturing subpatterns, |
| 1433 |
and is not available for passing back information. The length passed in |
and is not available for passing back information. The number passed in |
| 1434 |
\fIovecsize\fP should always be a multiple of three. If it is not, it is |
\fIovecsize\fP should always be a multiple of three. If it is not, it is |
| 1435 |
rounded down. |
rounded down. |
| 1436 |
.P |
.P |
| 1437 |
When a match is successful, information about captured substrings is returned |
When a match is successful, information about captured substrings is returned |
| 1438 |
in pairs of integers, starting at the beginning of \fIovector\fP, and |
in pairs of integers, starting at the beginning of \fIovector\fP, and |
| 1439 |
continuing up to two-thirds of its length at the most. The first element of a |
continuing up to two-thirds of its length at the most. The first element of |
| 1440 |
pair is set to the offset of the first character in a substring, and the second |
each pair is set to the byte offset of the first character in a substring, and |
| 1441 |
is set to the offset of the first character after the end of a substring. The |
the second is set to the byte offset of the first character after the end of a |
| 1442 |
first pair, \fIovector[0]\fP and \fIovector[1]\fP, identify the portion of the |
substring. \fBNote\fP: these values are always byte offsets, even in UTF-8 |
| 1443 |
subject string matched by the entire pattern. The next pair is used for the |
mode. They are not character counts. |
| 1444 |
first capturing subpattern, and so on. The value returned by \fBpcre_exec()\fP |
.P |
| 1445 |
is one more than the highest numbered pair that has been set. For example, if |
The first pair of integers, \fIovector[0]\fP and \fIovector[1]\fP, identify the |
| 1446 |
two substrings have been captured, the returned value is 3. If there are no |
portion of the subject string matched by the entire pattern. The next pair is |
| 1447 |
capturing subpatterns, the return value from a successful match is 1, |
used for the first capturing subpattern, and so on. The value returned by |
| 1448 |
indicating that just the first pair of offsets has been set. |
\fBpcre_exec()\fP is one more than the highest numbered pair that has been set. |
| 1449 |
|
For example, if two substrings have been captured, the returned value is 3. If |
| 1450 |
|
there are no capturing subpatterns, the return value from a successful match is |
| 1451 |
|
1, indicating that just the first pair of offsets has been set. |
| 1452 |
.P |
.P |
| 1453 |
If a capturing subpattern is matched repeatedly, it is the last portion of the |
If a capturing subpattern is matched repeatedly, it is the last portion of the |
| 1454 |
string that it matched that is returned. |
string that it matched that is returned. |
| 1455 |
.P |
.P |
| 1456 |
If the vector is too small to hold all the captured substring offsets, it is |
If the vector is too small to hold all the captured substring offsets, it is |
| 1457 |
used as far as possible (up to two-thirds of its length), and the function |
used as far as possible (up to two-thirds of its length), and the function |
| 1458 |
returns a value of zero. In particular, if the substring offsets are not of |
returns a value of zero. If the substring offsets are not of interest, |
| 1459 |
interest, \fBpcre_exec()\fP may be called with \fIovector\fP passed as NULL and |
\fBpcre_exec()\fP may be called with \fIovector\fP passed as NULL and |
| 1460 |
\fIovecsize\fP as zero. However, if the pattern contains back references and |
\fIovecsize\fP as zero. However, if the pattern contains back references and |
| 1461 |
the \fIovector\fP is not big enough to remember the related substrings, PCRE |
the \fIovector\fP is not big enough to remember the related substrings, PCRE |
| 1462 |
has to get additional memory for use during matching. Thus it is usually |
has to get additional memory for use during matching. Thus it is usually |
| 1742 |
then call \fBpcre_copy_substring()\fP or \fBpcre_get_substring()\fP, as |
then call \fBpcre_copy_substring()\fP or \fBpcre_get_substring()\fP, as |
| 1743 |
appropriate. \fBNOTE:\fP If PCRE_DUPNAMES is set and there are duplicate names, |
appropriate. \fBNOTE:\fP If PCRE_DUPNAMES is set and there are duplicate names, |
| 1744 |
the behaviour may not be what you want (see the next section). |
the behaviour may not be what you want (see the next section). |
| 1745 |
. |
.P |
| 1746 |
|
\fBWarning:\fP If the pattern uses the "(?|" feature to set up multiple |
| 1747 |
|
subpatterns with the same number, you cannot use names to distinguish them, |
| 1748 |
|
because names are not included in the compiled code. The matching process uses |
| 1749 |
|
only numbers. |
| 1750 |
. |
. |
| 1751 |
.SH "DUPLICATE SUBPATTERN NAMES" |
.SH "DUPLICATE SUBPATTERN NAMES" |
| 1752 |
.rs |
.rs |
| 1995 |
.rs |
.rs |
| 1996 |
.sp |
.sp |
| 1997 |
.nf |
.nf |
| 1998 |
Last updated: 10 September 2007 |
Last updated: 17 March 2009 |
| 1999 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 2000 |
.fi |
.fi |