| 134 |
.P |
.P |
| 135 |
In a Windows environment, if you want to statically link an application program |
In a Windows environment, if you want to statically link an application program |
| 136 |
against a non-dll \fBpcre.a\fP file, you must define PCRE_STATIC before |
against a non-dll \fBpcre.a\fP file, you must define PCRE_STATIC before |
| 137 |
including \fBpcre.h\fP, because otherwise the \fBpcre_malloc()\fP and |
including \fBpcre.h\fP or \fBpcrecpp.h\fP, because otherwise the |
| 138 |
\fBpcre_free()\fP exported functions will be declared |
\fBpcre_malloc()\fP and \fBpcre_free()\fP exported functions will be declared |
| 139 |
\fB__declspec(dllimport)\fP, with unwanted results. |
\fB__declspec(dllimport)\fP, with unwanted results. |
| 140 |
.P |
.P |
| 141 |
The functions \fBpcre_compile()\fP, \fBpcre_compile2()\fP, \fBpcre_study()\fP, |
The functions \fBpcre_compile()\fP, \fBpcre_compile2()\fP, \fBpcre_study()\fP, |
| 765 |
50 [this code is not in use] |
50 [this code is not in use] |
| 766 |
51 octal value is greater than \e377 (not in UTF-8 mode) |
51 octal value is greater than \e377 (not in UTF-8 mode) |
| 767 |
52 internal error: overran compiling workspace |
52 internal error: overran compiling workspace |
| 768 |
53 internal error: previously-checked referenced subpattern not found |
53 internal error: previously-checked referenced subpattern |
| 769 |
|
not found |
| 770 |
54 DEFINE group contains more than one branch |
54 DEFINE group contains more than one branch |
| 771 |
55 repeating a DEFINE group is not allowed |
55 repeating a DEFINE group is not allowed |
| 772 |
56 inconsistent NEWLINE options |
56 inconsistent NEWLINE options |
| 779 |
62 subpattern name expected |
62 subpattern name expected |
| 780 |
63 digit expected after (?+ |
63 digit expected after (?+ |
| 781 |
64 ] is an invalid data character in JavaScript compatibility mode |
64 ] is an invalid data character in JavaScript compatibility mode |
| 782 |
65 different names for subpatterns of the same number are not allowed |
65 different names for subpatterns of the same number are |
| 783 |
|
not allowed |
| 784 |
66 (*MARK) must have an argument |
66 (*MARK) must have an argument |
| 785 |
67 this version of PCRE is not compiled with PCRE_UCP support |
67 this version of PCRE is not compiled with PCRE_UCP support |
| 786 |
.sp |
.sp |
| 848 |
single fixed starting character. A bitmap of possible starting bytes is |
single fixed starting character. A bitmap of possible starting bytes is |
| 849 |
created. This speeds up finding a position in the subject at which to start |
created. This speeds up finding a position in the subject at which to start |
| 850 |
matching. |
matching. |
| 851 |
|
.P |
| 852 |
|
The two optimizations just described can be disabled by setting the |
| 853 |
|
PCRE_NO_START_OPTIMIZE option when calling \fBpcre_exec()\fP or |
| 854 |
|
\fBpcre_dfa_exec()\fP. You might want to do this if your pattern contains |
| 855 |
|
callouts, or make use of (*MARK), and you make use of these in cases where |
| 856 |
|
matching fails. See the discussion of PCRE_NO_START_OPTIMIZE |
| 857 |
|
.\" HTML <a href="#execoptions"> |
| 858 |
|
.\" </a> |
| 859 |
|
below. |
| 860 |
|
.\" |
| 861 |
. |
. |
| 862 |
. |
. |
| 863 |
.\" HTML <a name="localesupport"></a> |
.\" HTML <a name="localesupport"></a> |
| 1450 |
.\" HREF |
.\" HREF |
| 1451 |
\fBpcredemo\fP |
\fBpcredemo\fP |
| 1452 |
.\" |
.\" |
| 1453 |
sample program. |
sample program. In the most general case, you have to check to see if the |
| 1454 |
|
newline convention recognizes CRLF as a newline, and if so, and the current |
| 1455 |
|
character is CR followed by LF, advance the starting offset by two characters |
| 1456 |
|
instead of one. |
| 1457 |
.sp |
.sp |
| 1458 |
PCRE_NO_START_OPTIMIZE |
PCRE_NO_START_OPTIMIZE |
| 1459 |
.sp |
.sp |
| 1460 |
There are a number of optimizations that \fBpcre_exec()\fP uses at the start of |
There are a number of optimizations that \fBpcre_exec()\fP uses at the start of |
| 1461 |
a match, in order to speed up the process. For example, if it is known that a |
a match, in order to speed up the process. For example, if it is known that an |
| 1462 |
match must start with a specific character, it searches the subject for that |
unanchored match must start with a specific character, it searches the subject |
| 1463 |
character, and fails immediately if it cannot find it, without actually running |
for that character, and fails immediately if it cannot find it, without |
| 1464 |
the main matching function. When callouts are in use, these optimizations can |
actually running the main matching function. This means that a special item |
| 1465 |
cause them to be skipped. This option disables the "start-up" optimizations, |
such as (*COMMIT) at the start of a pattern is not considered until after a |
| 1466 |
causing performance to suffer, but ensuring that the callouts do occur. |
suitable starting point for the match has been found. When callouts or (*MARK) |
| 1467 |
|
items are in use, these "start-up" optimizations can cause them to be skipped |
| 1468 |
|
if the pattern is never actually used. The start-up optimizations are in effect |
| 1469 |
|
a pre-scan of the subject that takes place before the pattern is run. |
| 1470 |
|
.P |
| 1471 |
|
The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, possibly |
| 1472 |
|
causing performance to suffer, but ensuring that in cases where the result is |
| 1473 |
|
"no match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) |
| 1474 |
|
are considered at every possible starting position in the subject string. |
| 1475 |
|
Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching operation. |
| 1476 |
|
Consider the pattern |
| 1477 |
|
.sp |
| 1478 |
|
(*COMMIT)ABC |
| 1479 |
|
.sp |
| 1480 |
|
When this is compiled, PCRE records the fact that a match must start with the |
| 1481 |
|
character "A". Suppose the subject string is "DEFABC". The start-up |
| 1482 |
|
optimization scans along the subject, finds "A" and runs the first match |
| 1483 |
|
attempt from there. The (*COMMIT) item means that the pattern must match the |
| 1484 |
|
current starting position, which in this case, it does. However, if the same |
| 1485 |
|
match is run with PCRE_NO_START_OPTIMIZE set, the initial scan along the |
| 1486 |
|
subject string does not happen. The first match attempt is run starting from |
| 1487 |
|
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so |
| 1488 |
|
the overall result is "no match". If the pattern is studied, more start-up |
| 1489 |
|
optimizations may be used. For example, a minimum length for the subject may be |
| 1490 |
|
recorded. Consider the pattern |
| 1491 |
|
.sp |
| 1492 |
|
(*MARK:A)(X|Y) |
| 1493 |
|
.sp |
| 1494 |
|
The minimum length for a match is one character. If the subject is "ABC", there |
| 1495 |
|
will be attempts to match "ABC", "BC", "C", and then finally an empty string. |
| 1496 |
|
If the pattern is studied, the final attempt does not take place, because PCRE |
| 1497 |
|
knows that the subject is too short, and so the (*MARK) is never encountered. |
| 1498 |
|
In this case, studying the pattern does not affect the overall match result, |
| 1499 |
|
which is still "no match", but it does affect the auxiliary information that is |
| 1500 |
|
returned. |
| 1501 |
.sp |
.sp |
| 1502 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
| 1503 |
.sp |
.sp |
| 1515 |
\fBpcre\fP |
\fBpcre\fP |
| 1516 |
.\" |
.\" |
| 1517 |
page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns |
page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns |
| 1518 |
the error PCRE_ERROR_BADUTF8. If \fIstartoffset\fP contains an invalid value, |
the error PCRE_ERROR_BADUTF8. If \fIstartoffset\fP contains a value that does |
| 1519 |
|
not point to the start of a UTF-8 character (or to the end of the subject), |
| 1520 |
PCRE_ERROR_BADUTF8_OFFSET is returned. |
PCRE_ERROR_BADUTF8_OFFSET is returned. |
| 1521 |
.P |
.P |
| 1522 |
If you already know that your subject is valid, and you want to skip these |
If you already know that your subject is valid, and you want to skip these |
| 1524 |
calling \fBpcre_exec()\fP. You might want to do this for the second and |
calling \fBpcre_exec()\fP. You might want to do this for the second and |
| 1525 |
subsequent calls to \fBpcre_exec()\fP if you are making repeated calls to find |
subsequent calls to \fBpcre_exec()\fP if you are making repeated calls to find |
| 1526 |
all the matches in a single subject string. However, you should be sure that |
all the matches in a single subject string. However, you should be sure that |
| 1527 |
the value of \fIstartoffset\fP points to the start of a UTF-8 character. When |
the value of \fIstartoffset\fP points to the start of a UTF-8 character (or the |
| 1528 |
PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 string as a |
end of the subject). When PCRE_NO_UTF8_CHECK is set, the effect of passing an |
| 1529 |
subject, or a value of \fIstartoffset\fP that does not point to the start of a |
invalid UTF-8 string as a subject or an invalid value of \fIstartoffset\fP is |
| 1530 |
UTF-8 character, is undefined. Your program may crash. |
undefined. Your program may crash. |
| 1531 |
.sp |
.sp |
| 1532 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
| 1533 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
| 1536 |
compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match |
compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match |
| 1537 |
occurs if the end of the subject string is reached successfully, but there are |
occurs if the end of the subject string is reached successfully, but there are |
| 1538 |
not enough subject characters to complete the match. If this happens when |
not enough subject characters to complete the match. If this happens when |
| 1539 |
PCRE_PARTIAL_HARD is set, \fBpcre_exec()\fP immediately returns |
PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by |
| 1540 |
PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues |
testing any remaining alternatives. Only if no complete match can be found is |
| 1541 |
by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL |
PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words, |
| 1542 |
returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that |
PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match, |
| 1543 |
was inspected when the partial match was found is set as the first matching |
but only if no complete match can be found. |
| 1544 |
string. There is a more detailed discussion in the |
.P |
| 1545 |
|
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a |
| 1546 |
|
partial match is found, \fBpcre_exec()\fP immediately returns |
| 1547 |
|
PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words, |
| 1548 |
|
when PCRE_PARTIAL_HARD is set, a partial match is considered to be more |
| 1549 |
|
important that an alternative complete match. |
| 1550 |
|
.P |
| 1551 |
|
In both cases, the portion of the string that was inspected when the partial |
| 1552 |
|
match was found is set as the first matching string. There is a more detailed |
| 1553 |
|
discussion of partial and multi-segment matching, with examples, in the |
| 1554 |
.\" HREF |
.\" HREF |
| 1555 |
\fBpcrepartial\fP |
\fBpcrepartial\fP |
| 1556 |
.\" |
.\" |
| 1557 |
documentation. |
documentation. |
| 1558 |
. |
. |
| 1559 |
|
. |
| 1560 |
.SS "The string to be matched by \fBpcre_exec()\fP" |
.SS "The string to be matched by \fBpcre_exec()\fP" |
| 1561 |
.rs |
.rs |
| 1562 |
.sp |
.sp |
| 1563 |
The subject string is passed to \fBpcre_exec()\fP as a pointer in |
The subject string is passed to \fBpcre_exec()\fP as a pointer in |
| 1564 |
\fIsubject\fP, a length (in bytes) in \fIlength\fP, and a starting byte offset |
\fIsubject\fP, a length (in bytes) in \fIlength\fP, and a starting byte offset |
| 1565 |
in \fIstartoffset\fP. In UTF-8 mode, the byte offset must point to the start of |
in \fIstartoffset\fP. If this is negative or greater than the length of the |
| 1566 |
a UTF-8 character. Unlike the pattern string, the subject may contain binary |
subject, \fBpcre_exec()\fP returns PCRE_ERROR_BADOFFSET. |
| 1567 |
zero bytes. When the starting offset is zero, the search for a match starts at |
.P |
| 1568 |
the beginning of the subject, and this is by far the most common case. |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 character (or |
| 1569 |
|
the end of the subject). Unlike the pattern string, the subject may contain |
| 1570 |
|
binary zero bytes. When the starting offset is zero, the search for a match |
| 1571 |
|
starts at the beginning of the subject, and this is by far the most common |
| 1572 |
|
case. |
| 1573 |
.P |
.P |
| 1574 |
A non-zero starting offset is useful when searching for another match in the |
A non-zero starting offset is useful when searching for another match in the |
| 1575 |
same subject by calling \fBpcre_exec()\fP again after a previous success. |
same subject by calling \fBpcre_exec()\fP again after a previous success. |
| 1589 |
set to 4, it finds the second occurrence of "iss" because it is able to look |
set to 4, it finds the second occurrence of "iss" because it is able to look |
| 1590 |
behind the starting point to discover that it is preceded by a letter. |
behind the starting point to discover that it is preceded by a letter. |
| 1591 |
.P |
.P |
| 1592 |
|
Finding all the matches in a subject is tricky when the pattern can match an |
| 1593 |
|
empty string. It is possible to emulate Perl's /g behaviour by first trying the |
| 1594 |
|
match again at the same offset, with the PCRE_NOTEMPTY_ATSTART and |
| 1595 |
|
PCRE_ANCHORED options, and then if that fails, advancing the starting offset |
| 1596 |
|
and trying an ordinary match again. There is some code that demonstrates how to |
| 1597 |
|
do this in the |
| 1598 |
|
.\" HREF |
| 1599 |
|
\fBpcredemo\fP |
| 1600 |
|
.\" |
| 1601 |
|
sample program. In the most general case, you have to check to see if the |
| 1602 |
|
newline convention recognizes CRLF as a newline, and if so, and the current |
| 1603 |
|
character is CR followed by LF, advance the starting offset by two characters |
| 1604 |
|
instead of one. |
| 1605 |
|
.P |
| 1606 |
If a non-zero starting offset is passed when the pattern is anchored, one |
If a non-zero starting offset is passed when the pattern is anchored, one |
| 1607 |
attempt to match at the given offset is made. This can only succeed if the |
attempt to match at the given offset is made. This can only succeed if the |
| 1608 |
pattern does not require the match to be at the start of the subject. |
pattern does not require the match to be at the start of the subject. |
| 1609 |
. |
. |
| 1610 |
|
. |
| 1611 |
.SS "How \fBpcre_exec()\fP returns captured substrings" |
.SS "How \fBpcre_exec()\fP returns captured substrings" |
| 1612 |
.rs |
.rs |
| 1613 |
.sp |
.sp |
| 1674 |
expression are also set to -1. For example, if the string "abc" is matched |
expression are also set to -1. For example, if the string "abc" is matched |
| 1675 |
against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The |
against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The |
| 1676 |
return from the function is 2, because the highest used capturing subpattern |
return from the function is 2, because the highest used capturing subpattern |
| 1677 |
number is 1. However, you can refer to the offsets for the second and third |
number is 1, and the offsets for for the second and third capturing subpatterns |
| 1678 |
capturing subpatterns if you wish (assuming the vector is large enough, of |
(assuming the vector is large enough, of course) are set to -1. |
| 1679 |
course). |
.P |
| 1680 |
|
\fBNote\fP: Elements of \fIovector\fP that do not correspond to capturing |
| 1681 |
|
parentheses in the pattern are never changed. That is, if a pattern contains |
| 1682 |
|
\fIn\fP capturing parentheses, no more than \fIovector[0]\fP to |
| 1683 |
|
\fIovector[2n+1]\fP are set by \fBpcre_exec()\fP. The other elements retain |
| 1684 |
|
whatever values they previously had. |
| 1685 |
.P |
.P |
| 1686 |
Some convenience functions are provided for extracting the captured substrings |
Some convenience functions are provided for extracting the captured substrings |
| 1687 |
as separate strings. These are described below. |
as separate strings. These are described below. |
| 1727 |
gets a block of memory at the start of matching to use for this purpose. If the |
gets a block of memory at the start of matching to use for this purpose. If the |
| 1728 |
call via \fBpcre_malloc()\fP fails, this error is given. The memory is |
call via \fBpcre_malloc()\fP fails, this error is given. The memory is |
| 1729 |
automatically freed at the end of matching. |
automatically freed at the end of matching. |
| 1730 |
|
.P |
| 1731 |
|
This error is also given if \fBpcre_stack_malloc()\fP fails in |
| 1732 |
|
\fBpcre_exec()\fP. This can happen only when PCRE has been compiled with |
| 1733 |
|
\fB--disable-stack-for-recursion\fP. |
| 1734 |
.sp |
.sp |
| 1735 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
| 1736 |
.sp |
.sp |
| 1795 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
| 1796 |
.sp |
.sp |
| 1797 |
An invalid combination of PCRE_NEWLINE_\fIxxx\fP options was given. |
An invalid combination of PCRE_NEWLINE_\fIxxx\fP options was given. |
| 1798 |
|
.sp |
| 1799 |
|
PCRE_ERROR_BADOFFSET (-24) |
| 1800 |
|
.sp |
| 1801 |
|
The value of \fIstartoffset\fP was negative or greater than the length of the |
| 1802 |
|
subject, that is, the value in \fIlength\fP. |
| 1803 |
.P |
.P |
| 1804 |
Error numbers -16 to -20 and -22 are not used by \fBpcre_exec()\fP. |
Error numbers -16 to -20 and -22 are not used by \fBpcre_exec()\fP. |
| 1805 |
. |
. |
| 2084 |
The unused bits of the \fIoptions\fP argument for \fBpcre_dfa_exec()\fP must be |
The unused bits of the \fIoptions\fP argument for \fBpcre_dfa_exec()\fP must be |
| 2085 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_\fIxxx\fP, |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_\fIxxx\fP, |
| 2086 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, |
| 2087 |
PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, |
PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, |
| 2088 |
and PCRE_DFA_RESTART. All but the last four of these are exactly the same as |
PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. |
| 2089 |
for \fBpcre_exec()\fP, so their description is not repeated here. |
All but the last four of these are exactly the same as for \fBpcre_exec()\fP, |
| 2090 |
|
so their description is not repeated here. |
| 2091 |
.sp |
.sp |
| 2092 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
| 2093 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
| 2102 |
there have been no complete matches, but there is still at least one matching |
there have been no complete matches, but there is still at least one matching |
| 2103 |
possibility. The portion of the string that was inspected when the longest |
possibility. The portion of the string that was inspected when the longest |
| 2104 |
partial match was found is set as the first matching string in both cases. |
partial match was found is set as the first matching string in both cases. |
| 2105 |
|
There is a more detailed discussion of partial and multi-segment matching, with |
| 2106 |
|
examples, in the |
| 2107 |
|
.\" HREF |
| 2108 |
|
\fBpcrepartial\fP |
| 2109 |
|
.\" |
| 2110 |
|
documentation. |
| 2111 |
.sp |
.sp |
| 2112 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
| 2113 |
.sp |
.sp |
| 2227 |
.rs |
.rs |
| 2228 |
.sp |
.sp |
| 2229 |
.nf |
.nf |
| 2230 |
Last updated: 26 May 2010 |
Last updated: 06 November 2010 |
| 2231 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
| 2232 |
.fi |
.fi |