/[pcre]/code/trunk/doc/html/pcrepartial.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepartial.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 566 by ph10, Mon Oct 19 14:38:48 2009 UTC revision 567 by ph10, Sat Nov 6 17:10:00 2010 UTC
# Line 45  what has been typed so far is potentiall Line 45  what has been typed so far is potentiall
45  as soon as a mistake is made, by beeping and not reflecting the character that  as soon as a mistake is made, by beeping and not reflecting the character that
46  has been typed, for example. This immediate feedback is likely to be a better  has been typed, for example. This immediate feedback is likely to be a better
47  user interface than a check that is delayed until the entire string has been  user interface than a check that is delayed until the entire string has been
48  entered. Partial matching can also sometimes be useful when the subject string  entered. Partial matching can also be useful when the subject string is very
49  is very long and is not all available at once.  long and is not all available at once.
50  </P>  </P>
51  <P>  <P>
52  PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and  PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
# Line 68  also disabled for partial matching. Line 68  also disabled for partial matching.
68  </P>  </P>
69  <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>  <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>
70  <P>  <P>
71  A partial match occurs during a call to <b>pcre_exec()</b> whenever the end of  A partial match occurs during a call to <b>pcre_exec()</b> when the end of the
72  the subject string is reached successfully, but matching cannot continue  subject string is reached successfully, but matching cannot continue because
73  because more characters are needed. However, at least one character must have  more characters are needed. However, at least one character in the subject must
74  been matched. (In other words, a partial match can never be an empty string.)  have been inspected. This character need not form part of the final matched
75  </P>  string; lookbehind assertions and the \K escape sequence provide ways of
76  <P>  inspecting characters before the start of a matched substring. The requirement
77  If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching  for inspecting at least one character exists because an empty string can always
78  continues as normal, and other alternatives in the pattern are tried. If no  be matched; without such a restriction there would always be a partial match of
79  complete match can be found, <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL  an empty string at the end of the subject.
80  instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets  </P>
81  vector, the first of them is set to the offset of the earliest character that  <P>
82  was inspected when the partial match was found. For convenience, the second  If there are at least two slots in the offsets vector when <b>pcre_exec()</b>
83  offset points to the end of the string so that a substring can easily be  returns with a partial match, the first slot is set to the offset of the
84  identified.  earliest character that was inspected when the partial match was found. For
85    convenience, the second offset points to the end of the subject so that a
86    substring can easily be identified.
87  </P>  </P>
88  <P>  <P>
89  For the majority of patterns, the first offset identifies the start of the  For the majority of patterns, the first offset identifies the start of the
# Line 94  inspected while carrying out the match. Line 96  inspected while carrying out the match.
96  This pattern matches "123", but only if it is preceded by "abc". If the subject  This pattern matches "123", but only if it is preceded by "abc". If the subject
97  string is "xyzabc12", the offsets after a partial match are for the substring  string is "xyzabc12", the offsets after a partial match are for the substring
98  "abc12", because all these characters are needed if another match is tried  "abc12", because all these characters are needed if another match is tried
99  with extra characters added.  with extra characters added to the subject.
100    </P>
101    <P>
102    What happens when a partial match is identified depends on which of the two
103    partial matching options are set.
104    </P>
105    <br><b>
106    PCRE_PARTIAL_SOFT with pcre_exec()
107    </b><br>
108    <P>
109    If PCRE_PARTIAL_SOFT is set when <b>pcre_exec()</b> identifies a partial match,
110    the partial match is remembered, but matching continues as normal, and other
111    alternatives in the pattern are tried. If no complete match can be found,
112    <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH.
113    </P>
114    <P>
115    This option is "soft" because it prefers a complete match over a partial match.
116    All the various matching items in a pattern behave as if the subject string is
117    potentially complete. For example, \z, \Z, and $ match at the end of the
118    subject, as normal, and for \b and \B the end of the subject is treated as a
119    non-alphanumeric.
120  </P>  </P>
121  <P>  <P>
122  If there is more than one partial match, the first one that was found provides  If there is more than one partial match, the first one that was found provides
# Line 104  the data that is returned. Consider this Line 126  the data that is returned. Consider this
126  </pre>  </pre>
127  If this is matched against the subject string "abc123dog", both  If this is matched against the subject string "abc123dog", both
128  alternatives fail to match, but the end of the subject is reached during  alternatives fail to match, but the end of the subject is reached during
129  matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The  matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
130  offsets are set to 3 and 9, identifying "123dog" as the first partial match  identifying "123dog" as the first partial match that was found. (In this
131  that was found. (In this example, there are two partial matches, because "dog"  example, there are two partial matches, because "dog" on its own partially
132  on its own partially matches the second alternative.)  matches the second alternative.)
133  </P>  </P>
134    <br><b>
135    PCRE_PARTIAL_HARD with pcre_exec()
136    </b><br>
137  <P>  <P>
138  If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns  If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
139  PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to  PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
140  search for possible complete matches. The difference between the two options  search for possible complete matches. This option is "hard" because it prefers
141  can be illustrated by a pattern such as:  an earlier partial match over a later complete match. For this reason, the
142    assumption is made that the end of the supplied subject string may not be the
143    true end of the available data, and so, if \z, \Z, \b, \B, or $ are
144    encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
145    </P>
146    <br><b>
147    Comparing hard and soft partial matching
148    </b><br>
149    <P>
150    The difference between the two partial matching options can be illustrated by a
151    pattern such as:
152  <pre>  <pre>
153    /dog(sbody)?/    /dog(sbody)?/
154  </pre>  </pre>
# Line 141  The pcre_dfa_exec() function move Line 176  The pcre_dfa_exec() function move
176  character, without backtracking, searching for all possible matches  character, without backtracking, searching for all possible matches
177  simultaneously. If the end of the subject is reached before the end of the  simultaneously. If the end of the subject is reached before the end of the
178  pattern, there is the possibility of a partial match, again provided that at  pattern, there is the possibility of a partial match, again provided that at
179  least one character has matched.  least one character has been inspected.
180  </P>  </P>
181  <P>  <P>
182  When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there  When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
# Line 262  From release 8.00, pcre_exec() ca Line 297  From release 8.00, pcre_exec() ca
297  matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the  matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
298  previous match with a new segment of data. Instead, new data must be added to  previous match with a new segment of data. Instead, new data must be added to
299  the previous subject string, and the entire match re-run, starting from the  the previous subject string, and the entire match re-run, starting from the
300  point where the partial match occurred. Earlier data can be discarded.  point where the partial match occurred. Earlier data can be discarded. It is
301  Consider an unanchored pattern that matches dates:  best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the
302    end of a segment as the end of the subject when matching \z, \Z, \b, \B,
303    and $. Consider an unanchored pattern that matches dates:
304  <pre>  <pre>
305      re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/      re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
306    data&#62; The date is 23ja\P    data&#62; The date is 23ja\P\P
307    Partial match: 23ja    Partial match: 23ja
308  </pre>  </pre>
309  At this stage, an application could discard the text preceding "23ja", add on  At this stage, an application could discard the text preceding "23ja", add on
# Line 287  Certain types of pattern may give proble Line 324  Certain types of pattern may give proble
324  whichever matching function is used.  whichever matching function is used.
325  </P>  </P>
326  <P>  <P>
327  1. If the pattern contains tests for the beginning or end of a line, you need  1. If the pattern contains a test for the beginning of a line, you need to pass
328  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the  the PCRE_NOTBOL option when the subject string for any call does start at the
329  subject string for any call does not contain the beginning or end of a line.  beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
330    doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
331    includes the effect of PCRE_NOTEOL.
332  </P>  </P>
333  <P>  <P>
334  2. Lookbehind assertions at the start of a pattern are catered for in the  2. Lookbehind assertions at the start of a pattern are catered for in the
# Line 305  always produce exactly the same result a Line 344  always produce exactly the same result a
344  especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and  especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
345  Word Boundaries" above describes an issue that arises if the pattern ends with  Word Boundaries" above describes an issue that arises if the pattern ends with
346  \b or \B. Another kind of difference may occur when there are multiple  \b or \B. Another kind of difference may occur when there are multiple
347  matching possibilities, because a partial match result is given only when there  matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
348  are no completed matches. This means that as soon as the shortest match has  is given only when there are no completed matches. This means that as soon as
349  been found, continuation to a new subject segment is no longer possible.  the shortest match has been found, continuation to a new subject segment is no
350  Consider again this <b>pcretest</b> example:  longer possible. Consider again this <b>pcretest</b> example:
351  <pre>  <pre>
352      re&#62; /dog(sbody)?/      re&#62; /dog(sbody)?/
353    data&#62; dogsb\P    data&#62; dogsb\P
# Line 331  the other hand, if "dogsbody" is present Line 370  the other hand, if "dogsbody" is present
370  <b>pcre_dfa_exec()</b> finds both matches.  <b>pcre_dfa_exec()</b> finds both matches.
371  </P>  </P>
372  <P>  <P>
373  Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when  Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
374  matching multi-segment data. The example above then behaves differently:  multi-segment data. The example above then behaves differently:
375  <pre>  <pre>
376      re&#62; /dog(sbody)?/      re&#62; /dog(sbody)?/
377    data&#62; dogsb\P\P    data&#62; dogsb\P\P
# Line 368  problem if pcre_exec() is used, b Line 407  problem if pcre_exec() is used, b
407  each time:  each time:
408  <pre>  <pre>
409      re&#62; /1234|3789/      re&#62; /1234|3789/
410    data&#62; ABC123\P    data&#62; ABC123\P\P
411    Partial match: 123    Partial match: 123
412    data&#62; 1237890    data&#62; 1237890
413     0: 3789     0: 3789
414  </pre>  </pre>
415  Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-running  Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
416  the entire match can also be used with <b>pcre_dfa_exec()</b>. Another  the entire match can also be used with <b>pcre_dfa_exec()</b>. Another
417  possibility is to work with two buffers. If a partial match at offset <i>n</i>  possibility is to work with two buffers. If a partial match at offset <i>n</i>
418  in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on  in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
# Line 391  Cambridge CB2 3QH, England. Line 430  Cambridge CB2 3QH, England.
430  </P>  </P>
431  <br><a name="SEC11" href="#TOC1">REVISION</a><br>  <br><a name="SEC11" href="#TOC1">REVISION</a><br>
432  <P>  <P>
433  Last updated: 19 October 2009  Last updated: 22 October 2010
434  <br>  <br>
435  Copyright &copy; 1997-2009 University of Cambridge.  Copyright &copy; 1997-2010 University of Cambridge.
436  <br>  <br>
437  <p>  <p>
438  Return to the <a href="index.html">PCRE index page</a>.  Return to the <a href="index.html">PCRE index page</a>.

Legend:
Removed from v.566  
changed lines
  Added in v.567

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12