| 1443 |
ordinary match again. There is some code that demonstrates how to do this in |
ordinary match again. There is some code that demonstrates how to do this in |
| 1444 |
the |
the |
| 1445 |
<a href="pcredemo.html"><b>pcredemo</b></a> |
<a href="pcredemo.html"><b>pcredemo</b></a> |
| 1446 |
sample program. |
sample program. In the most general case, you have to check to see if the |
| 1447 |
|
newline convention recognizes CRLF as a newline, and if so, and the current |
| 1448 |
|
character is CR followed by LF, advance the starting offset by two characters |
| 1449 |
|
instead of one. |
| 1450 |
<pre> |
<pre> |
| 1451 |
PCRE_NO_START_OPTIMIZE |
PCRE_NO_START_OPTIMIZE |
| 1452 |
</pre> |
</pre> |
| 1504 |
in the main |
in the main |
| 1505 |
<a href="pcre.html"><b>pcre</b></a> |
<a href="pcre.html"><b>pcre</b></a> |
| 1506 |
page. If an invalid UTF-8 sequence of bytes is found, <b>pcre_exec()</b> returns |
page. If an invalid UTF-8 sequence of bytes is found, <b>pcre_exec()</b> returns |
| 1507 |
the error PCRE_ERROR_BADUTF8. If <i>startoffset</i> contains an invalid value, |
the error PCRE_ERROR_BADUTF8. If <i>startoffset</i> contains a value that does |
| 1508 |
|
not point to the start of a UTF-8 character (or to the end of the subject), |
| 1509 |
PCRE_ERROR_BADUTF8_OFFSET is returned. |
PCRE_ERROR_BADUTF8_OFFSET is returned. |
| 1510 |
</P> |
</P> |
| 1511 |
<P> |
<P> |
| 1514 |
calling <b>pcre_exec()</b>. You might want to do this for the second and |
calling <b>pcre_exec()</b>. You might want to do this for the second and |
| 1515 |
subsequent calls to <b>pcre_exec()</b> if you are making repeated calls to find |
subsequent calls to <b>pcre_exec()</b> if you are making repeated calls to find |
| 1516 |
all the matches in a single subject string. However, you should be sure that |
all the matches in a single subject string. However, you should be sure that |
| 1517 |
the value of <i>startoffset</i> points to the start of a UTF-8 character. When |
the value of <i>startoffset</i> points to the start of a UTF-8 character (or the |
| 1518 |
PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 string as a |
end of the subject). When PCRE_NO_UTF8_CHECK is set, the effect of passing an |
| 1519 |
subject, or a value of <i>startoffset</i> that does not point to the start of a |
invalid UTF-8 string as a subject or an invalid value of <i>startoffset</i> is |
| 1520 |
UTF-8 character, is undefined. Your program may crash. |
undefined. Your program may crash. |
| 1521 |
<pre> |
<pre> |
| 1522 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
| 1523 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
| 1526 |
compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match |
compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match |
| 1527 |
occurs if the end of the subject string is reached successfully, but there are |
occurs if the end of the subject string is reached successfully, but there are |
| 1528 |
not enough subject characters to complete the match. If this happens when |
not enough subject characters to complete the match. If this happens when |
| 1529 |
PCRE_PARTIAL_HARD is set, <b>pcre_exec()</b> immediately returns |
PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by |
| 1530 |
PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues |
testing any remaining alternatives. Only if no complete match can be found is |
| 1531 |
by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL |
PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words, |
| 1532 |
returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that |
PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match, |
| 1533 |
was inspected when the partial match was found is set as the first matching |
but only if no complete match can be found. |
| 1534 |
string. There is a more detailed discussion in the |
</P> |
| 1535 |
|
<P> |
| 1536 |
|
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a |
| 1537 |
|
partial match is found, <b>pcre_exec()</b> immediately returns |
| 1538 |
|
PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words, |
| 1539 |
|
when PCRE_PARTIAL_HARD is set, a partial match is considered to be more |
| 1540 |
|
important that an alternative complete match. |
| 1541 |
|
</P> |
| 1542 |
|
<P> |
| 1543 |
|
In both cases, the portion of the string that was inspected when the partial |
| 1544 |
|
match was found is set as the first matching string. There is a more detailed |
| 1545 |
|
discussion of partial and multi-segment matching, with examples, in the |
| 1546 |
<a href="pcrepartial.html"><b>pcrepartial</b></a> |
<a href="pcrepartial.html"><b>pcrepartial</b></a> |
| 1547 |
documentation. |
documentation. |
| 1548 |
</P> |
</P> |
| 1552 |
<P> |
<P> |
| 1553 |
The subject string is passed to <b>pcre_exec()</b> as a pointer in |
The subject string is passed to <b>pcre_exec()</b> as a pointer in |
| 1554 |
<i>subject</i>, a length (in bytes) in <i>length</i>, and a starting byte offset |
<i>subject</i>, a length (in bytes) in <i>length</i>, and a starting byte offset |
| 1555 |
in <i>startoffset</i>. In UTF-8 mode, the byte offset must point to the start of |
in <i>startoffset</i>. If this is negative or greater than the length of the |
| 1556 |
a UTF-8 character. Unlike the pattern string, the subject may contain binary |
subject, <b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET. |
| 1557 |
zero bytes. When the starting offset is zero, the search for a match starts at |
</P> |
| 1558 |
the beginning of the subject, and this is by far the most common case. |
<P> |
| 1559 |
|
In UTF-8 mode, the byte offset must point to the start of a UTF-8 character (or |
| 1560 |
|
the end of the subject). Unlike the pattern string, the subject may contain |
| 1561 |
|
binary zero bytes. When the starting offset is zero, the search for a match |
| 1562 |
|
starts at the beginning of the subject, and this is by far the most common |
| 1563 |
|
case. |
| 1564 |
</P> |
</P> |
| 1565 |
<P> |
<P> |
| 1566 |
A non-zero starting offset is useful when searching for another match in the |
A non-zero starting offset is useful when searching for another match in the |
| 1582 |
behind the starting point to discover that it is preceded by a letter. |
behind the starting point to discover that it is preceded by a letter. |
| 1583 |
</P> |
</P> |
| 1584 |
<P> |
<P> |
| 1585 |
|
Finding all the matches in a subject is tricky when the pattern can match an |
| 1586 |
|
empty string. It is possible to emulate Perl's /g behaviour by first trying the |
| 1587 |
|
match again at the same offset, with the PCRE_NOTEMPTY_ATSTART and |
| 1588 |
|
PCRE_ANCHORED options, and then if that fails, advancing the starting offset |
| 1589 |
|
and trying an ordinary match again. There is some code that demonstrates how to |
| 1590 |
|
do this in the |
| 1591 |
|
<a href="pcredemo.html"><b>pcredemo</b></a> |
| 1592 |
|
sample program. In the most general case, you have to check to see if the |
| 1593 |
|
newline convention recognizes CRLF as a newline, and if so, and the current |
| 1594 |
|
character is CR followed by LF, advance the starting offset by two characters |
| 1595 |
|
instead of one. |
| 1596 |
|
</P> |
| 1597 |
|
<P> |
| 1598 |
If a non-zero starting offset is passed when the pattern is anchored, one |
If a non-zero starting offset is passed when the pattern is anchored, one |
| 1599 |
attempt to match at the given offset is made. This can only succeed if the |
attempt to match at the given offset is made. This can only succeed if the |
| 1600 |
pattern does not require the match to be at the start of the subject. |
pattern does not require the match to be at the start of the subject. |
| 1789 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
| 1790 |
</pre> |
</pre> |
| 1791 |
An invalid combination of PCRE_NEWLINE_<i>xxx</i> options was given. |
An invalid combination of PCRE_NEWLINE_<i>xxx</i> options was given. |
| 1792 |
|
<pre> |
| 1793 |
|
PCRE_ERROR_BADOFFSET (-24) |
| 1794 |
|
</pre> |
| 1795 |
|
The value of <i>startoffset</i> was negative or greater than the length of the |
| 1796 |
|
subject, that is, the value in <i>length</i>. |
| 1797 |
</P> |
</P> |
| 1798 |
<P> |
<P> |
| 1799 |
Error numbers -16 to -20 and -22 are not used by <b>pcre_exec()</b>. |
Error numbers -16 to -20 and -22 are not used by <b>pcre_exec()</b>. |
| 2088 |
there have been no complete matches, but there is still at least one matching |
there have been no complete matches, but there is still at least one matching |
| 2089 |
possibility. The portion of the string that was inspected when the longest |
possibility. The portion of the string that was inspected when the longest |
| 2090 |
partial match was found is set as the first matching string in both cases. |
partial match was found is set as the first matching string in both cases. |
| 2091 |
|
There is a more detailed discussion of partial and multi-segment matching, with |
| 2092 |
|
examples, in the |
| 2093 |
|
<a href="pcrepartial.html"><b>pcrepartial</b></a> |
| 2094 |
|
documentation. |
| 2095 |
<pre> |
<pre> |
| 2096 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
| 2097 |
</pre> |
</pre> |
| 2203 |
</P> |
</P> |
| 2204 |
<br><a name="SEC22" href="#TOC1">REVISION</a><br> |
<br><a name="SEC22" href="#TOC1">REVISION</a><br> |
| 2205 |
<P> |
<P> |
| 2206 |
Last updated: 21 June 2010 |
Last updated: 06 November 2010 |
| 2207 |
<br> |
<br> |
| 2208 |
Copyright © 1997-2010 University of Cambridge. |
Copyright © 1997-2010 University of Cambridge. |
| 2209 |
<br> |
<br> |