| 14 |
<br> |
<br> |
| 15 |
<ul> |
<ul> |
| 16 |
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a> |
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a> |
| 17 |
<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec()</a> |
<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()</a> |
| 18 |
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec()</a> |
<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()</a> |
| 19 |
<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a> |
<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a> |
| 20 |
<li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a> |
<li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a> |
| 21 |
<li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a> |
<li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a> |
| 22 |
<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a> |
<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()</a> |
| 23 |
<li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec()</a> |
<li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()</a> |
| 24 |
<li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a> |
<li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a> |
| 25 |
<li><a name="TOC10" href="#SEC10">AUTHOR</a> |
<li><a name="TOC10" href="#SEC10">AUTHOR</a> |
| 26 |
<li><a name="TOC11" href="#SEC11">REVISION</a> |
<li><a name="TOC11" href="#SEC11">REVISION</a> |
| 27 |
</ul> |
</ul> |
| 28 |
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br> |
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br> |
| 29 |
<P> |
<P> |
| 30 |
In normal use of PCRE, if the subject string that is passed to |
In normal use of PCRE, if the subject string that is passed to a matching |
| 31 |
<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is |
function matches as far as it goes, but is too short to match the entire |
| 32 |
too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There |
pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where it might |
| 33 |
are circumstances where it might be helpful to distinguish this case from other |
be helpful to distinguish this case from other cases in which there is no |
| 34 |
cases in which there is no match. |
match. |
| 35 |
</P> |
</P> |
| 36 |
<P> |
<P> |
| 37 |
Consider, for example, an application where a human is required to type in data |
Consider, for example, an application where a human is required to type in data |
| 45 |
as soon as a mistake is made, by beeping and not reflecting the character that |
as soon as a mistake is made, by beeping and not reflecting the character that |
| 46 |
has been typed, for example. This immediate feedback is likely to be a better |
has been typed, for example. This immediate feedback is likely to be a better |
| 47 |
user interface than a check that is delayed until the entire string has been |
user interface than a check that is delayed until the entire string has been |
| 48 |
entered. Partial matching can also sometimes be useful when the subject string |
entered. Partial matching can also be useful when the subject string is very |
| 49 |
is very long and is not all available at once. |
long and is not all available at once. |
| 50 |
</P> |
</P> |
| 51 |
<P> |
<P> |
| 52 |
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and |
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and |
| 53 |
PCRE_PARTIAL_HARD options, which can be set when calling <b>pcre_exec()</b> or |
PCRE_PARTIAL_HARD options, which can be set when calling any of the matching |
| 54 |
<b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym |
functions. For backwards compatibility, PCRE_PARTIAL is a synonym for |
| 55 |
for PCRE_PARTIAL_SOFT. The essential difference between the two options is |
PCRE_PARTIAL_SOFT. The essential difference between the two options is whether |
| 56 |
whether or not a partial match is preferred to an alternative complete match, |
or not a partial match is preferred to an alternative complete match, though |
| 57 |
though the details differ between the two matching functions. If both options |
the details differ between the two types of matching function. If both options |
| 58 |
are set, PCRE_PARTIAL_HARD takes precedence. |
are set, PCRE_PARTIAL_HARD takes precedence. |
| 59 |
</P> |
</P> |
| 60 |
<P> |
<P> |
| 61 |
Setting a partial matching option disables two of PCRE's optimizations. PCRE |
If you want to use partial matching with just-in-time optimized code, you must |
| 62 |
remembers the last literal byte in a pattern, and abandons matching immediately |
call <b>pcre_study()</b> or <b>pcre16_study()</b> with one or both of these |
| 63 |
if such a byte is not present in the subject string. This optimization cannot |
options: |
| 64 |
be used for a subject string that might match only partially. If the pattern |
<pre> |
| 65 |
was studied, PCRE knows the minimum length of a matching string, and does not |
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE |
| 66 |
bother to run the matching function on shorter strings. This optimization is |
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE |
| 67 |
also disabled for partial matching. |
</pre> |
| 68 |
</P> |
PCRE_STUDY_JIT_COMPILE should also be set if you are going to run non-partial |
| 69 |
<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br> |
matches on the same pattern. If the appropriate JIT study mode has not been set |
| 70 |
<P> |
for a match, the interpretive matching code is used. |
| 71 |
A partial match occurs during a call to <b>pcre_exec()</b> whenever the end of |
</P> |
| 72 |
the subject string is reached successfully, but matching cannot continue |
<P> |
| 73 |
because more characters are needed. However, at least one character must have |
Setting a partial matching option disables two of PCRE's standard |
| 74 |
been matched. (In other words, a partial match can never be an empty string.) |
optimizations. PCRE remembers the last literal data unit in a pattern, and |
| 75 |
</P> |
abandons matching immediately if it is not present in the subject string. This |
| 76 |
<P> |
optimization cannot be used for a subject string that might match only |
| 77 |
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching |
partially. If the pattern was studied, PCRE knows the minimum length of a |
| 78 |
continues as normal, and other alternatives in the pattern are tried. If no |
matching string, and does not bother to run the matching function on shorter |
| 79 |
complete match can be found, <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL |
strings. This optimization is also disabled for partial matching. |
| 80 |
instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets |
</P> |
| 81 |
vector, the first of them is set to the offset of the earliest character that |
<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()</a><br> |
| 82 |
was inspected when the partial match was found. For convenience, the second |
<P> |
| 83 |
offset points to the end of the string so that a substring can easily be |
A partial match occurs during a call to <b>pcre_exec()</b> or |
| 84 |
identified. |
<b>pcre16_exec()</b> when the end of the subject string is reached successfully, |
| 85 |
|
but matching cannot continue because more characters are needed. However, at |
| 86 |
|
least one character in the subject must have been inspected. This character |
| 87 |
|
need not form part of the final matched string; lookbehind assertions and the |
| 88 |
|
\K escape sequence provide ways of inspecting characters before the start of a |
| 89 |
|
matched substring. The requirement for inspecting at least one character exists |
| 90 |
|
because an empty string can always be matched; without such a restriction there |
| 91 |
|
would always be a partial match of an empty string at the end of the subject. |
| 92 |
|
</P> |
| 93 |
|
<P> |
| 94 |
|
If there are at least two slots in the offsets vector when a partial match is |
| 95 |
|
returned, the first slot is set to the offset of the earliest character that |
| 96 |
|
was inspected. For convenience, the second offset points to the end of the |
| 97 |
|
subject so that a substring can easily be identified. |
| 98 |
</P> |
</P> |
| 99 |
<P> |
<P> |
| 100 |
For the majority of patterns, the first offset identifies the start of the |
For the majority of patterns, the first offset identifies the start of the |
| 107 |
This pattern matches "123", but only if it is preceded by "abc". If the subject |
This pattern matches "123", but only if it is preceded by "abc". If the subject |
| 108 |
string is "xyzabc12", the offsets after a partial match are for the substring |
string is "xyzabc12", the offsets after a partial match are for the substring |
| 109 |
"abc12", because all these characters are needed if another match is tried |
"abc12", because all these characters are needed if another match is tried |
| 110 |
with extra characters added. |
with extra characters added to the subject. |
| 111 |
|
</P> |
| 112 |
|
<P> |
| 113 |
|
What happens when a partial match is identified depends on which of the two |
| 114 |
|
partial matching options are set. |
| 115 |
|
</P> |
| 116 |
|
<br><b> |
| 117 |
|
PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre16_exec() |
| 118 |
|
</b><br> |
| 119 |
|
<P> |
| 120 |
|
If PCRE_PARTIAL_SOFT is set when <b>pcre_exec()</b> or <b>pcre16_exec()</b> |
| 121 |
|
identifies a partial match, the partial match is remembered, but matching |
| 122 |
|
continues as normal, and other alternatives in the pattern are tried. If no |
| 123 |
|
complete match can be found, PCRE_ERROR_PARTIAL is returned instead of |
| 124 |
|
PCRE_ERROR_NOMATCH. |
| 125 |
|
</P> |
| 126 |
|
<P> |
| 127 |
|
This option is "soft" because it prefers a complete match over a partial match. |
| 128 |
|
All the various matching items in a pattern behave as if the subject string is |
| 129 |
|
potentially complete. For example, \z, \Z, and $ match at the end of the |
| 130 |
|
subject, as normal, and for \b and \B the end of the subject is treated as a |
| 131 |
|
non-alphanumeric. |
| 132 |
</P> |
</P> |
| 133 |
<P> |
<P> |
| 134 |
If there is more than one partial match, the first one that was found provides |
If there is more than one partial match, the first one that was found provides |
| 138 |
</pre> |
</pre> |
| 139 |
If this is matched against the subject string "abc123dog", both |
If this is matched against the subject string "abc123dog", both |
| 140 |
alternatives fail to match, but the end of the subject is reached during |
alternatives fail to match, but the end of the subject is reached during |
| 141 |
matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The |
matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, |
| 142 |
offsets are set to 3 and 9, identifying "123dog" as the first partial match |
identifying "123dog" as the first partial match that was found. (In this |
| 143 |
that was found. (In this example, there are two partial matches, because "dog" |
example, there are two partial matches, because "dog" on its own partially |
| 144 |
on its own partially matches the second alternative.) |
matches the second alternative.) |
| 145 |
</P> |
</P> |
| 146 |
|
<br><b> |
| 147 |
|
PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre16_exec() |
| 148 |
|
</b><br> |
| 149 |
|
<P> |
| 150 |
|
If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b> or <b>pcre16_exec()</b>, |
| 151 |
|
PCRE_ERROR_PARTIAL is returned as soon as a partial match is found, without |
| 152 |
|
continuing to search for possible complete matches. This option is "hard" |
| 153 |
|
because it prefers an earlier partial match over a later complete match. For |
| 154 |
|
this reason, the assumption is made that the end of the supplied subject string |
| 155 |
|
may not be the true end of the available data, and so, if \z, \Z, \b, \B, |
| 156 |
|
or $ are encountered at the end of the subject, the result is |
| 157 |
|
PCRE_ERROR_PARTIAL, provided that at least one character in the subject has |
| 158 |
|
been inspected. |
| 159 |
|
</P> |
| 160 |
|
<P> |
| 161 |
|
Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 |
| 162 |
|
subject strings are checked for validity. Normally, an invalid sequence |
| 163 |
|
causes the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16. However, in the |
| 164 |
|
special case of a truncated character at the end of the subject, |
| 165 |
|
PCRE_ERROR_SHORTUTF8 or PCRE_ERROR_SHORTUTF16 is returned when |
| 166 |
|
PCRE_PARTIAL_HARD is set. |
| 167 |
|
</P> |
| 168 |
|
<br><b> |
| 169 |
|
Comparing hard and soft partial matching |
| 170 |
|
</b><br> |
| 171 |
<P> |
<P> |
| 172 |
If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns |
The difference between the two partial matching options can be illustrated by a |
| 173 |
PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to |
pattern such as: |
|
search for possible complete matches. The difference between the two options |
|
|
can be illustrated by a pattern such as: |
|
| 174 |
<pre> |
<pre> |
| 175 |
/dog(sbody)?/ |
/dog(sbody)?/ |
| 176 |
</pre> |
</pre> |
| 182 |
<pre> |
<pre> |
| 183 |
/dog(sbody)??/ |
/dog(sbody)??/ |
| 184 |
</pre> |
</pre> |
| 185 |
In this case the result is always a complete match because <b>pcre_exec()</b> |
In this case the result is always a complete match because that is found first, |
| 186 |
finds that first, and it never continues after finding a match. It might be |
and matching never continues after finding a complete match. It might be easier |
| 187 |
easier to follow this explanation by thinking of the two patterns like this: |
to follow this explanation by thinking of the two patterns like this: |
| 188 |
<pre> |
<pre> |
| 189 |
/dog(sbody)?/ is the same as /dogsbody|dog/ |
/dog(sbody)?/ is the same as /dogsbody|dog/ |
| 190 |
/dog(sbody)??/ is the same as /dog|dogsbody/ |
/dog(sbody)??/ is the same as /dog|dogsbody/ |
| 191 |
</pre> |
</pre> |
| 192 |
The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is |
The second pattern will never match "dogsbody", because it will always find the |
| 193 |
used, because it will always find the shorter match first. |
shorter match first. |
| 194 |
</P> |
</P> |
| 195 |
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec()</a><br> |
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()</a><br> |
| 196 |
<P> |
<P> |
| 197 |
The <b>pcre_dfa_exec()</b> function moves along the subject string character by |
The DFA functions move along the subject string character by character, without |
| 198 |
character, without backtracking, searching for all possible matches |
backtracking, searching for all possible matches simultaneously. If the end of |
| 199 |
simultaneously. If the end of the subject is reached before the end of the |
the subject is reached before the end of the pattern, there is the possibility |
| 200 |
pattern, there is the possibility of a partial match, again provided that at |
of a partial match, again provided that at least one character has been |
| 201 |
least one character has matched. |
inspected. |
| 202 |
</P> |
</P> |
| 203 |
<P> |
<P> |
| 204 |
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there |
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there |
| 209 |
at least two slots in the offsets vector. |
at least two slots in the offsets vector. |
| 210 |
</P> |
</P> |
| 211 |
<P> |
<P> |
| 212 |
Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and |
Because the DFA functions always search for all possible matches, and there is |
| 213 |
there is no difference between greedy and ungreedy repetition, its behaviour is |
no difference between greedy and ungreedy repetition, their behaviour is |
| 214 |
different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the |
different from the standard functions when PCRE_PARTIAL_HARD is set. Consider |
| 215 |
string "dog" matched against the ungreedy pattern shown above: |
the string "dog" matched against the ungreedy pattern shown above: |
| 216 |
<pre> |
<pre> |
| 217 |
/dog(sbody)??/ |
/dog(sbody)??/ |
| 218 |
</pre> |
</pre> |
| 219 |
Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for |
Whereas the standard functions stop as soon as they find the complete match for |
| 220 |
"dog", <b>pcre_dfa_exec()</b> also finds the partial match for "dogsbody", and |
"dog", the DFA functions also find the partial match for "dogsbody", and so |
| 221 |
so returns that when PCRE_PARTIAL_HARD is set. |
return that when PCRE_PARTIAL_HARD is set. |
| 222 |
</P> |
</P> |
| 223 |
<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br> |
<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br> |
| 224 |
<P> |
<P> |
| 230 |
</pre> |
</pre> |
| 231 |
This matches "cat", provided there is a word boundary at either end. If the |
This matches "cat", provided there is a word boundary at either end. If the |
| 232 |
subject string is "the cat", the comparison of the final "t" with a following |
subject string is "the cat", the comparison of the final "t" with a following |
| 233 |
character cannot take place, so a partial match is found. However, |
character cannot take place, so a partial match is found. However, normal |
| 234 |
<b>pcre_exec()</b> carries on with normal matching, which matches \b at the end |
matching carries on, and \b matches at the end of the subject when the last |
| 235 |
of the subject when the last character is a letter, thus finding a complete |
character is a letter, so a complete match is found. The result, therefore, is |
| 236 |
match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing |
<i>not</i> PCRE_ERROR_PARTIAL. Using PCRE_PARTIAL_HARD in this case does yield |
| 237 |
happens with <b>pcre_dfa_exec()</b>, because it also finds the complete match. |
PCRE_ERROR_PARTIAL, because then the partial match takes precedence. |
|
</P> |
|
|
<P> |
|
|
Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because |
|
|
then the partial match takes precedence. |
|
| 238 |
</P> |
</P> |
| 239 |
<br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br> |
<br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br> |
| 240 |
<P> |
<P> |
| 242 |
optimizations were implemented in the <b>pcre_exec()</b> function, the |
optimizations were implemented in the <b>pcre_exec()</b> function, the |
| 243 |
PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with |
PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with |
| 244 |
all patterns. From release 8.00 onwards, the restrictions no longer apply, and |
all patterns. From release 8.00 onwards, the restrictions no longer apply, and |
| 245 |
partial matching with <b>pcre_exec()</b> can be requested for any pattern. |
partial matching with can be requested for any pattern. |
| 246 |
</P> |
</P> |
| 247 |
<P> |
<P> |
| 248 |
Items that were formerly restricted were repeated single characters and |
Items that were formerly restricted were repeated single characters and |
| 274 |
The first data string is matched completely, so <b>pcretest</b> shows the |
The first data string is matched completely, so <b>pcretest</b> shows the |
| 275 |
matched substrings. The remaining four strings do not match the complete |
matched substrings. The remaining four strings do not match the complete |
| 276 |
pattern, but the first two are partial matches. Similar output is obtained |
pattern, but the first two are partial matches. Similar output is obtained |
| 277 |
when <b>pcre_dfa_exec()</b> is used. |
if DFA matching is used. |
| 278 |
</P> |
</P> |
| 279 |
<P> |
<P> |
| 280 |
If the escape sequence \P is present more than once in a <b>pcretest</b> data |
If the escape sequence \P is present more than once in a <b>pcretest</b> data |
| 281 |
line, the PCRE_PARTIAL_HARD option is set for the match. |
line, the PCRE_PARTIAL_HARD option is set for the match. |
| 282 |
</P> |
</P> |
| 283 |
<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br> |
<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()</a><br> |
| 284 |
<P> |
<P> |
| 285 |
When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible |
When a partial match has been found using a DFA matching function, it is |
| 286 |
to continue the match by providing additional subject data and calling |
possible to continue the match by providing additional subject data and calling |
| 287 |
<b>pcre_dfa_exec()</b> again with the same compiled regular expression, this |
the function again with the same compiled regular expression, this time setting |
| 288 |
time setting the PCRE_DFA_RESTART option. You must pass the same working |
the PCRE_DFA_RESTART option. You must pass the same working space as before, |
| 289 |
space as before, because this is where details of the previous partial match |
because this is where details of the previous partial match are stored. Here is |
| 290 |
are stored. Here is an example using <b>pcretest</b>, using the \R escape |
an example using <b>pcretest</b>, using the \R escape sequence to set the |
| 291 |
sequence to set the PCRE_DFA_RESTART option (\D specifies the use of |
PCRE_DFA_RESTART option (\D specifies the use of the DFA matching function): |
|
<b>pcre_dfa_exec()</b>): |
|
| 292 |
<pre> |
<pre> |
| 293 |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
| 294 |
data> 23ja\P\D |
data> 23ja\P\D |
| 305 |
<P> |
<P> |
| 306 |
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with |
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with |
| 307 |
PCRE_DFA_RESTART to continue partial matching over multiple segments. This |
PCRE_DFA_RESTART to continue partial matching over multiple segments. This |
| 308 |
facility can be used to pass very long subject strings to |
facility can be used to pass very long subject strings to the DFA matching |
| 309 |
<b>pcre_dfa_exec()</b>. |
functions. |
| 310 |
|
</P> |
| 311 |
|
<br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()</a><br> |
| 312 |
|
<P> |
| 313 |
|
From release 8.00, the standard matching functions can also be used to do |
| 314 |
|
multi-segment matching. Unlike the DFA functions, it is not possible to |
| 315 |
|
restart the previous match with a new segment of data. Instead, new data must |
| 316 |
|
be added to the previous subject string, and the entire match re-run, starting |
| 317 |
|
from the point where the partial match occurred. Earlier data can be discarded. |
| 318 |
</P> |
</P> |
|
<br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec()</a><br> |
|
| 319 |
<P> |
<P> |
| 320 |
From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment |
It is best to use PCRE_PARTIAL_HARD in this situation, because it does not |
| 321 |
matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the |
treat the end of a segment as the end of the subject when matching \z, \Z, |
| 322 |
previous match with a new segment of data. Instead, new data must be added to |
\b, \B, and $. Consider an unanchored pattern that matches dates: |
|
the previous subject string, and the entire match re-run, starting from the |
|
|
point where the partial match occurred. Earlier data can be discarded. |
|
|
Consider an unanchored pattern that matches dates: |
|
| 323 |
<pre> |
<pre> |
| 324 |
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
| 325 |
data> The date is 23ja\P |
data> The date is 23ja\P\P |
| 326 |
Partial match: 23ja |
Partial match: 23ja |
| 327 |
</pre> |
</pre> |
| 328 |
At this stage, an application could discard the text preceding "23ja", add on |
At this stage, an application could discard the text preceding "23ja", add on |
| 329 |
text from the next segment, and call <b>pcre_exec()</b> again. Unlike |
text from the next segment, and call the matching function again. Unlike the |
| 330 |
<b>pcre_dfa_exec()</b>, the entire matching string must always be available, and |
DFA matching functions, the entire matching string must always be available, |
| 331 |
the complete matching process occurs for each call, so more memory and more |
and the complete matching process occurs for each call, so more memory and more |
| 332 |
processing time is needed. |
processing time is needed. |
| 333 |
</P> |
</P> |
| 334 |
<P> |
<P> |
| 335 |
<b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts |
<b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts |
| 336 |
with \b or \B, the string that is returned for a partial match will include |
with \b or \B, the string that is returned for a partial match includes |
| 337 |
characters that precede the partially matched string itself, because these must |
characters that precede the partially matched string itself, because these must |
| 338 |
be retained when adding on more characters for a subsequent matching attempt. |
be retained when adding on more characters for a subsequent matching attempt. |
| 339 |
|
However, in some cases you may need to retain even earlier characters, as |
| 340 |
|
discussed in the next section. |
| 341 |
</P> |
</P> |
| 342 |
<br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br> |
<br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br> |
| 343 |
<P> |
<P> |
| 345 |
whichever matching function is used. |
whichever matching function is used. |
| 346 |
</P> |
</P> |
| 347 |
<P> |
<P> |
| 348 |
1. If the pattern contains tests for the beginning or end of a line, you need |
1. If the pattern contains a test for the beginning of a line, you need to pass |
| 349 |
to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the |
the PCRE_NOTBOL option when the subject string for any call does start at the |
| 350 |
subject string for any call does not contain the beginning or end of a line. |
beginning of a line. There is also a PCRE_NOTEOL option, but in practice when |
| 351 |
</P> |
doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which |
| 352 |
<P> |
includes the effect of PCRE_NOTEOL. |
| 353 |
2. Lookbehind assertions at the start of a pattern are catered for in the |
</P> |
| 354 |
offsets that are returned for a partial match. However, in theory, a lookbehind |
<P> |
| 355 |
assertion later in the pattern could require even earlier characters to be |
2. Lookbehind assertions that have already been obeyed are catered for in the |
| 356 |
inspected, and it might not have been reached when a partial match occurs. This |
offsets that are returned for a partial match. However a lookbehind assertion |
| 357 |
is probably an extremely unlikely case; you could guard against it to a certain |
later in the pattern could require even earlier characters to be inspected. You |
| 358 |
extent by always including extra characters at the start. |
can handle this case by using the PCRE_INFO_MAXLOOKBEHIND option of the |
| 359 |
|
<b>pcre_fullinfo()</b> or <b>pcre16_fullinfo()</b> functions to obtain the length |
| 360 |
|
of the largest lookbehind in the pattern. This length is given in characters, |
| 361 |
|
not bytes. If you always retain at least that many characters before the |
| 362 |
|
partially matched string, all should be well. (Of course, near the start of the |
| 363 |
|
subject, fewer characters may be present; in that case all characters should be |
| 364 |
|
retained.) |
| 365 |
|
</P> |
| 366 |
|
<P> |
| 367 |
|
3. Because a partial match must always contain at least one character, what |
| 368 |
|
might be considered a partial match of an empty string actually gives a "no |
| 369 |
|
match" result. For example: |
| 370 |
|
<pre> |
| 371 |
|
re> /c(?<=abc)x/ |
| 372 |
|
data> ab\P |
| 373 |
|
No match |
| 374 |
|
</pre> |
| 375 |
|
If the next segment begins "cx", a match should be found, but this will only |
| 376 |
|
happen if characters from the previous segment are retained. For this reason, a |
| 377 |
|
"no match" result should be interpreted as "partial match of an empty string" |
| 378 |
|
when the pattern contains lookbehinds. |
| 379 |
</P> |
</P> |
| 380 |
<P> |
<P> |
| 381 |
3. Matching a subject string that is split into multiple segments may not |
4. Matching a subject string that is split into multiple segments may not |
| 382 |
always produce exactly the same result as matching over one single long string, |
always produce exactly the same result as matching over one single long string, |
| 383 |
especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and |
especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and |
| 384 |
Word Boundaries" above describes an issue that arises if the pattern ends with |
Word Boundaries" above describes an issue that arises if the pattern ends with |
| 385 |
\b or \B. Another kind of difference may occur when there are multiple |
\b or \B. Another kind of difference may occur when there are multiple |
| 386 |
matching possibilities, because a partial match result is given only when there |
matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result |
| 387 |
are no completed matches. This means that as soon as the shortest match has |
is given only when there are no completed matches. This means that as soon as |
| 388 |
been found, continuation to a new subject segment is no longer possible. |
the shortest match has been found, continuation to a new subject segment is no |
| 389 |
Consider again this <b>pcretest</b> example: |
longer possible. Consider again this <b>pcretest</b> example: |
| 390 |
<pre> |
<pre> |
| 391 |
re> /dog(sbody)?/ |
re> /dog(sbody)?/ |
| 392 |
data> dogsb\P |
data> dogsb\P |
| 399 |
0: dogsbody |
0: dogsbody |
| 400 |
1: dog |
1: dog |
| 401 |
</pre> |
</pre> |
| 402 |
The first data line passes the string "dogsb" to <b>pcre_exec()</b>, setting the |
The first data line passes the string "dogsb" to a standard matching function, |
| 403 |
PCRE_PARTIAL_SOFT option. Although the string is a partial match for |
setting the PCRE_PARTIAL_SOFT option. Although the string is a partial match |
| 404 |
"dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string |
for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter |
| 405 |
"dog" is a complete match. Similarly, when the subject is presented to |
string "dog" is a complete match. Similarly, when the subject is presented to |
| 406 |
<b>pcre_dfa_exec()</b> in several parts ("do" and "gsb" being the first two) the |
a DFA matching function in several parts ("do" and "gsb" being the first two) |
| 407 |
match stops when "dog" has been found, and it is not possible to continue. On |
the match stops when "dog" has been found, and it is not possible to continue. |
| 408 |
the other hand, if "dogsbody" is presented as a single string, |
On the other hand, if "dogsbody" is presented as a single string, a DFA |
| 409 |
<b>pcre_dfa_exec()</b> finds both matches. |
matching function finds both matches. |
| 410 |
</P> |
</P> |
| 411 |
<P> |
<P> |
| 412 |
Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when |
Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching |
| 413 |
matching multi-segment data. The example above then behaves differently: |
multi-segment data. The example above then behaves differently: |
| 414 |
<pre> |
<pre> |
| 415 |
re> /dog(sbody)?/ |
re> /dog(sbody)?/ |
| 416 |
data> dogsb\P\P |
data> dogsb\P\P |
| 419 |
Partial match: do |
Partial match: do |
| 420 |
data> gsb\R\P\P\D |
data> gsb\R\P\P\D |
| 421 |
Partial match: gsb |
Partial match: gsb |
| 422 |
|
</pre> |
| 423 |
</PRE> |
5. Patterns that contain alternatives at the top level which do not all start |
| 424 |
</P> |
with the same pattern item may not work as expected when PCRE_DFA_RESTART is |
| 425 |
<P> |
used. For example, consider this pattern: |
|
4. Patterns that contain alternatives at the top level which do not all |
|
|
start with the same pattern item may not work as expected when |
|
|
PCRE_DFA_RESTART is used with <b>pcre_dfa_exec()</b>. For example, consider this |
|
|
pattern: |
|
| 426 |
<pre> |
<pre> |
| 427 |
1234|3789 |
1234|3789 |
| 428 |
</pre> |
</pre> |
| 438 |
1234|ABCD |
1234|ABCD |
| 439 |
</pre> |
</pre> |
| 440 |
where no string can be a partial match for both alternatives. This is not a |
where no string can be a partial match for both alternatives. This is not a |
| 441 |
problem if <b>pcre_exec()</b> is used, because the entire match has to be rerun |
problem if a standard matching function is used, because the entire match has |
| 442 |
each time: |
to be rerun each time: |
| 443 |
<pre> |
<pre> |
| 444 |
re> /1234|3789/ |
re> /1234|3789/ |
| 445 |
data> ABC123\P |
data> ABC123\P\P |
| 446 |
Partial match: 123 |
Partial match: 123 |
| 447 |
data> 1237890 |
data> 1237890 |
| 448 |
0: 3789 |
0: 3789 |
| 449 |
</pre> |
</pre> |
| 450 |
Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-running |
Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running |
| 451 |
the entire match can also be used with <b>pcre_dfa_exec()</b>. Another |
the entire match can also be used with the DFA matching functions. Another |
| 452 |
possibility is to work with two buffers. If a partial match at offset <i>n</i> |
possibility is to work with two buffers. If a partial match at offset <i>n</i> |
| 453 |
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on |
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on |
| 454 |
the second buffer, you can then try a new match starting at offset <i>n+1</i> in |
the second buffer, you can then try a new match starting at offset <i>n+1</i> in |
| 465 |
</P> |
</P> |
| 466 |
<br><a name="SEC11" href="#TOC1">REVISION</a><br> |
<br><a name="SEC11" href="#TOC1">REVISION</a><br> |
| 467 |
<P> |
<P> |
| 468 |
Last updated: 19 October 2009 |
Last updated: 24 February 2012 |
| 469 |
<br> |
<br> |
| 470 |
Copyright © 1997-2009 University of Cambridge. |
Copyright © 1997-2012 University of Cambridge. |
| 471 |
<br> |
<br> |
| 472 |
<p> |
<p> |
| 473 |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |