| 45 |
as soon as a mistake is made, by beeping and not reflecting the character that |
as soon as a mistake is made, by beeping and not reflecting the character that |
| 46 |
has been typed, for example. This immediate feedback is likely to be a better |
has been typed, for example. This immediate feedback is likely to be a better |
| 47 |
user interface than a check that is delayed until the entire string has been |
user interface than a check that is delayed until the entire string has been |
| 48 |
entered. Partial matching can also sometimes be useful when the subject string |
entered. Partial matching can also be useful when the subject string is very |
| 49 |
is very long and is not all available at once. |
long and is not all available at once. |
| 50 |
</P> |
</P> |
| 51 |
<P> |
<P> |
| 52 |
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and |
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and |
| 68 |
</P> |
</P> |
| 69 |
<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br> |
<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br> |
| 70 |
<P> |
<P> |
| 71 |
A partial match occurs during a call to <b>pcre_exec()</b> whenever the end of |
A partial match occurs during a call to <b>pcre_exec()</b> when the end of the |
| 72 |
the subject string is reached successfully, but matching cannot continue |
subject string is reached successfully, but matching cannot continue because |
| 73 |
because more characters are needed. However, at least one character must have |
more characters are needed. However, at least one character in the subject must |
| 74 |
been matched. (In other words, a partial match can never be an empty string.) |
have been inspected. This character need not form part of the final matched |
| 75 |
</P> |
string; lookbehind assertions and the \K escape sequence provide ways of |
| 76 |
<P> |
inspecting characters before the start of a matched substring. The requirement |
| 77 |
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching |
for inspecting at least one character exists because an empty string can always |
| 78 |
continues as normal, and other alternatives in the pattern are tried. If no |
be matched; without such a restriction there would always be a partial match of |
| 79 |
complete match can be found, <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL |
an empty string at the end of the subject. |
| 80 |
instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets |
</P> |
| 81 |
vector, the first of them is set to the offset of the earliest character that |
<P> |
| 82 |
was inspected when the partial match was found. For convenience, the second |
If there are at least two slots in the offsets vector when <b>pcre_exec()</b> |
| 83 |
offset points to the end of the string so that a substring can easily be |
returns with a partial match, the first slot is set to the offset of the |
| 84 |
identified. |
earliest character that was inspected when the partial match was found. For |
| 85 |
|
convenience, the second offset points to the end of the subject so that a |
| 86 |
|
substring can easily be identified. |
| 87 |
</P> |
</P> |
| 88 |
<P> |
<P> |
| 89 |
For the majority of patterns, the first offset identifies the start of the |
For the majority of patterns, the first offset identifies the start of the |
| 96 |
This pattern matches "123", but only if it is preceded by "abc". If the subject |
This pattern matches "123", but only if it is preceded by "abc". If the subject |
| 97 |
string is "xyzabc12", the offsets after a partial match are for the substring |
string is "xyzabc12", the offsets after a partial match are for the substring |
| 98 |
"abc12", because all these characters are needed if another match is tried |
"abc12", because all these characters are needed if another match is tried |
| 99 |
with extra characters added. |
with extra characters added to the subject. |
| 100 |
|
</P> |
| 101 |
|
<P> |
| 102 |
|
What happens when a partial match is identified depends on which of the two |
| 103 |
|
partial matching options are set. |
| 104 |
|
</P> |
| 105 |
|
<br><b> |
| 106 |
|
PCRE_PARTIAL_SOFT with pcre_exec() |
| 107 |
|
</b><br> |
| 108 |
|
<P> |
| 109 |
|
If PCRE_PARTIAL_SOFT is set when <b>pcre_exec()</b> identifies a partial match, |
| 110 |
|
the partial match is remembered, but matching continues as normal, and other |
| 111 |
|
alternatives in the pattern are tried. If no complete match can be found, |
| 112 |
|
<b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. |
| 113 |
|
</P> |
| 114 |
|
<P> |
| 115 |
|
This option is "soft" because it prefers a complete match over a partial match. |
| 116 |
|
All the various matching items in a pattern behave as if the subject string is |
| 117 |
|
potentially complete. For example, \z, \Z, and $ match at the end of the |
| 118 |
|
subject, as normal, and for \b and \B the end of the subject is treated as a |
| 119 |
|
non-alphanumeric. |
| 120 |
</P> |
</P> |
| 121 |
<P> |
<P> |
| 122 |
If there is more than one partial match, the first one that was found provides |
If there is more than one partial match, the first one that was found provides |
| 126 |
</pre> |
</pre> |
| 127 |
If this is matched against the subject string "abc123dog", both |
If this is matched against the subject string "abc123dog", both |
| 128 |
alternatives fail to match, but the end of the subject is reached during |
alternatives fail to match, but the end of the subject is reached during |
| 129 |
matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The |
matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, |
| 130 |
offsets are set to 3 and 9, identifying "123dog" as the first partial match |
identifying "123dog" as the first partial match that was found. (In this |
| 131 |
that was found. (In this example, there are two partial matches, because "dog" |
example, there are two partial matches, because "dog" on its own partially |
| 132 |
on its own partially matches the second alternative.) |
matches the second alternative.) |
| 133 |
</P> |
</P> |
| 134 |
|
<br><b> |
| 135 |
|
PCRE_PARTIAL_HARD with pcre_exec() |
| 136 |
|
</b><br> |
| 137 |
<P> |
<P> |
| 138 |
If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns |
If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns |
| 139 |
PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to |
PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to |
| 140 |
search for possible complete matches. The difference between the two options |
search for possible complete matches. This option is "hard" because it prefers |
| 141 |
can be illustrated by a pattern such as: |
an earlier partial match over a later complete match. For this reason, the |
| 142 |
|
assumption is made that the end of the supplied subject string may not be the |
| 143 |
|
true end of the available data, and so, if \z, \Z, \b, \B, or $ are |
| 144 |
|
encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL. |
| 145 |
|
</P> |
| 146 |
|
<br><b> |
| 147 |
|
Comparing hard and soft partial matching |
| 148 |
|
</b><br> |
| 149 |
|
<P> |
| 150 |
|
The difference between the two partial matching options can be illustrated by a |
| 151 |
|
pattern such as: |
| 152 |
<pre> |
<pre> |
| 153 |
/dog(sbody)?/ |
/dog(sbody)?/ |
| 154 |
</pre> |
</pre> |
| 176 |
character, without backtracking, searching for all possible matches |
character, without backtracking, searching for all possible matches |
| 177 |
simultaneously. If the end of the subject is reached before the end of the |
simultaneously. If the end of the subject is reached before the end of the |
| 178 |
pattern, there is the possibility of a partial match, again provided that at |
pattern, there is the possibility of a partial match, again provided that at |
| 179 |
least one character has matched. |
least one character has been inspected. |
| 180 |
</P> |
</P> |
| 181 |
<P> |
<P> |
| 182 |
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there |
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there |
| 297 |
matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the |
matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the |
| 298 |
previous match with a new segment of data. Instead, new data must be added to |
previous match with a new segment of data. Instead, new data must be added to |
| 299 |
the previous subject string, and the entire match re-run, starting from the |
the previous subject string, and the entire match re-run, starting from the |
| 300 |
point where the partial match occurred. Earlier data can be discarded. |
point where the partial match occurred. Earlier data can be discarded. It is |
| 301 |
Consider an unanchored pattern that matches dates: |
best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the |
| 302 |
|
end of a segment as the end of the subject when matching \z, \Z, \b, \B, |
| 303 |
|
and $. Consider an unanchored pattern that matches dates: |
| 304 |
<pre> |
<pre> |
| 305 |
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
| 306 |
data> The date is 23ja\P |
data> The date is 23ja\P\P |
| 307 |
Partial match: 23ja |
Partial match: 23ja |
| 308 |
</pre> |
</pre> |
| 309 |
At this stage, an application could discard the text preceding "23ja", add on |
At this stage, an application could discard the text preceding "23ja", add on |
| 324 |
whichever matching function is used. |
whichever matching function is used. |
| 325 |
</P> |
</P> |
| 326 |
<P> |
<P> |
| 327 |
1. If the pattern contains tests for the beginning or end of a line, you need |
1. If the pattern contains a test for the beginning of a line, you need to pass |
| 328 |
to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the |
the PCRE_NOTBOL option when the subject string for any call does start at the |
| 329 |
subject string for any call does not contain the beginning or end of a line. |
beginning of a line. There is also a PCRE_NOTEOL option, but in practice when |
| 330 |
|
doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which |
| 331 |
|
includes the effect of PCRE_NOTEOL. |
| 332 |
</P> |
</P> |
| 333 |
<P> |
<P> |
| 334 |
2. Lookbehind assertions at the start of a pattern are catered for in the |
2. Lookbehind assertions at the start of a pattern are catered for in the |
| 344 |
especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and |
especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and |
| 345 |
Word Boundaries" above describes an issue that arises if the pattern ends with |
Word Boundaries" above describes an issue that arises if the pattern ends with |
| 346 |
\b or \B. Another kind of difference may occur when there are multiple |
\b or \B. Another kind of difference may occur when there are multiple |
| 347 |
matching possibilities, because a partial match result is given only when there |
matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result |
| 348 |
are no completed matches. This means that as soon as the shortest match has |
is given only when there are no completed matches. This means that as soon as |
| 349 |
been found, continuation to a new subject segment is no longer possible. |
the shortest match has been found, continuation to a new subject segment is no |
| 350 |
Consider again this <b>pcretest</b> example: |
longer possible. Consider again this <b>pcretest</b> example: |
| 351 |
<pre> |
<pre> |
| 352 |
re> /dog(sbody)?/ |
re> /dog(sbody)?/ |
| 353 |
data> dogsb\P |
data> dogsb\P |
| 370 |
<b>pcre_dfa_exec()</b> finds both matches. |
<b>pcre_dfa_exec()</b> finds both matches. |
| 371 |
</P> |
</P> |
| 372 |
<P> |
<P> |
| 373 |
Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when |
Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching |
| 374 |
matching multi-segment data. The example above then behaves differently: |
multi-segment data. The example above then behaves differently: |
| 375 |
<pre> |
<pre> |
| 376 |
re> /dog(sbody)?/ |
re> /dog(sbody)?/ |
| 377 |
data> dogsb\P\P |
data> dogsb\P\P |
| 407 |
each time: |
each time: |
| 408 |
<pre> |
<pre> |
| 409 |
re> /1234|3789/ |
re> /1234|3789/ |
| 410 |
data> ABC123\P |
data> ABC123\P\P |
| 411 |
Partial match: 123 |
Partial match: 123 |
| 412 |
data> 1237890 |
data> 1237890 |
| 413 |
0: 3789 |
0: 3789 |
| 414 |
</pre> |
</pre> |
| 415 |
Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-running |
Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running |
| 416 |
the entire match can also be used with <b>pcre_dfa_exec()</b>. Another |
the entire match can also be used with <b>pcre_dfa_exec()</b>. Another |
| 417 |
possibility is to work with two buffers. If a partial match at offset <i>n</i> |
possibility is to work with two buffers. If a partial match at offset <i>n</i> |
| 418 |
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on |
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on |
| 430 |
</P> |
</P> |
| 431 |
<br><a name="SEC11" href="#TOC1">REVISION</a><br> |
<br><a name="SEC11" href="#TOC1">REVISION</a><br> |
| 432 |
<P> |
<P> |
| 433 |
Last updated: 19 October 2009 |
Last updated: 22 October 2010 |
| 434 |
<br> |
<br> |
| 435 |
Copyright © 1997-2009 University of Cambridge. |
Copyright © 1997-2010 University of Cambridge. |
| 436 |
<br> |
<br> |
| 437 |
<p> |
<p> |
| 438 |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |