/[pcre]/code/trunk/doc/html/pcrepartial.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcrepartial.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 461 - (hide annotations) (download) (as text)
Mon Oct 5 10:59:35 2009 UTC (3 years, 7 months ago) by ph10
File MIME type: text/html
File size: 17703 byte(s)
Tidy up, remove trailing spaces, etc. for 8.00-RC1.

1 nigel 75 <html>
2     <head>
3     <title>pcrepartial specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6     <h1>pcrepartial man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10 ph10 111 <p>
11 nigel 75 This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14 ph10 111 <br>
15 nigel 75 <ul>
16     <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
17 ph10 429 <li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec()</a>
18     <li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec()</a>
19     <li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
20     <li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a>
21     <li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
22     <li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
23     <li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec()</a>
24     <li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a>
25     <li><a name="TOC10" href="#SEC10">AUTHOR</a>
26     <li><a name="TOC11" href="#SEC11">REVISION</a>
27 nigel 75 </ul>
28     <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
29     <P>
30     In normal use of PCRE, if the subject string that is passed to
31 nigel 77 <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
32     too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
33     are circumstances where it might be helpful to distinguish this case from other
34     cases in which there is no match.
35 nigel 75 </P>
36     <P>
37     Consider, for example, an application where a human is required to type in data
38     for a field with specific formatting requirements. An example might be a date
39     in the form <i>ddmmmyy</i>, defined by this pattern:
40     <pre>
41     ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
42     </pre>
43     If the application sees the user's keystrokes one by one, and can check that
44     what has been typed so far is potentially valid, it is able to raise an error
45 ph10 429 as soon as a mistake is made, by beeping and not reflecting the character that
46     has been typed, for example. This immediate feedback is likely to be a better
47 nigel 75 user interface than a check that is delayed until the entire string has been
48 ph10 429 entered. Partial matching can also sometimes be useful when the subject string
49     is very long and is not all available at once.
50 nigel 75 </P>
51     <P>
52 ph10 429 PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
53     PCRE_PARTIAL_HARD options, which can be set when calling <b>pcre_exec()</b> or
54 ph10 453 <b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym
55     for PCRE_PARTIAL_SOFT. The essential difference between the two options is
56     whether or not a partial match is preferred to an alternative complete match,
57     though the details differ between the two matching functions. If both options
58 ph10 429 are set, PCRE_PARTIAL_HARD takes precedence.
59 nigel 75 </P>
60     <P>
61 ph10 461 Setting a partial matching option disables two of PCRE's optimizations. PCRE
62 ph10 429 remembers the last literal byte in a pattern, and abandons matching immediately
63     if such a byte is not present in the subject string. This optimization cannot
64 ph10 461 be used for a subject string that might match only partially. If the pattern
65     was studied, PCRE knows the minimum length of a matching string, and does not
66     bother to run the matching function on shorter strings. This optimization is
67     also disabled for partial matching.
68 nigel 77 </P>
69 ph10 429 <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>
70 nigel 77 <P>
71 ph10 429 A partial match occurs during a call to <b>pcre_exec()</b> whenever the end of
72     the subject string is reached successfully, but matching cannot continue
73     because more characters are needed. However, at least one character must have
74     been matched. (In other words, a partial match can never be an empty string.)
75 nigel 75 </P>
76     <P>
77 ph10 429 If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
78     continues as normal, and other alternatives in the pattern are tried. If no
79     complete match can be found, <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL
80 ph10 453 instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
81     vector, the first of them is set to the offset of the earliest character that
82     was inspected when the partial match was found. For convenience, the second
83     offset points to the end of the string so that a substring can easily be
84 ph10 461 identified.
85 ph10 453 </P>
86     <P>
87     For the majority of patterns, the first offset identifies the start of the
88     partially matched string. However, for patterns that contain lookbehind
89     assertions, or \K, or begin with \b or \B, earlier characters have been
90     inspected while carrying out the match. For example:
91 nigel 75 <pre>
92 ph10 453 /(?&#60;=abc)123/
93     </pre>
94     This pattern matches "123", but only if it is preceded by "abc". If the subject
95     string is "xyzabc12", the offsets after a partial match are for the substring
96     "abc12", because all these characters are needed if another match is tried
97     with extra characters added.
98     </P>
99     <P>
100     If there is more than one partial match, the first one that was found provides
101     the data that is returned. Consider this pattern:
102     <pre>
103 ph10 429 /123\w+X|dogY/
104 nigel 75 </pre>
105 ph10 429 If this is matched against the subject string "abc123dog", both
106 ph10 453 alternatives fail to match, but the end of the subject is reached during
107 ph10 429 matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
108 ph10 453 offsets are set to 3 and 9, identifying "123dog" as the first partial match
109 ph10 429 that was found. (In this example, there are two partial matches, because "dog"
110     on its own partially matches the second alternative.)
111     </P>
112     <P>
113 ph10 453 If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
114 ph10 429 PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
115     search for possible complete matches. The difference between the two options
116     can be illustrated by a pattern such as:
117 nigel 75 <pre>
118 ph10 429 /dog(sbody)?/
119 nigel 75 </pre>
120 ph10 453 This matches either "dog" or "dogsbody", greedily (that is, it prefers the
121 ph10 429 longer string if possible). If it is matched against the string "dog" with
122 ph10 453 PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
123     PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
124 ph10 429 if the pattern is made ungreedy the result is different:
125 nigel 75 <pre>
126 ph10 429 /dog(sbody)??/
127 nigel 75 </pre>
128 ph10 453 In this case the result is always a complete match because <b>pcre_exec()</b>
129     finds that first, and it never continues after finding a match. It might be
130 ph10 429 easier to follow this explanation by thinking of the two patterns like this:
131     <pre>
132     /dog(sbody)?/ is the same as /dogsbody|dog/
133     /dog(sbody)??/ is the same as /dog|dogsbody/
134     </pre>
135 ph10 453 The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is
136 ph10 429 used, because it will always find the shorter match first.
137 nigel 75 </P>
138 ph10 429 <br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec()</a><br>
139 nigel 75 <P>
140 ph10 453 The <b>pcre_dfa_exec()</b> function moves along the subject string character by
141     character, without backtracking, searching for all possible matches
142     simultaneously. If the end of the subject is reached before the end of the
143 ph10 429 pattern, there is the possibility of a partial match, again provided that at
144     least one character has matched.
145 nigel 75 </P>
146     <P>
147 ph10 429 When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
148     have been no complete matches. Otherwise, the complete matches are returned.
149     However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
150 ph10 453 complete matches. The portion of the string that was inspected when the longest
151     partial match was found is set as the first matching string, provided there are
152     at least two slots in the offsets vector.
153 ph10 429 </P>
154     <P>
155 ph10 453 Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and
156 ph10 429 there is no difference between greedy and ungreedy repetition, its behaviour is
157 ph10 453 different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the
158 ph10 429 string "dog" matched against the ungreedy pattern shown above:
159     <pre>
160     /dog(sbody)??/
161     </pre>
162 ph10 453 Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for
163 ph10 429 "dog", <b>pcre_dfa_exec()</b> also finds the partial match for "dogsbody", and
164     so returns that when PCRE_PARTIAL_HARD is set.
165     </P>
166     <br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
167     <P>
168 ph10 453 If a pattern ends with one of sequences \w or \W, which test for word
169     boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
170 ph10 429 results. Consider this pattern:
171     <pre>
172     /\bcat\b/
173     </pre>
174     This matches "cat", provided there is a word boundary at either end. If the
175     subject string is "the cat", the comparison of the final "t" with a following
176 ph10 453 character cannot take place, so a partial match is found. However,
177     <b>pcre_exec()</b> carries on with normal matching, which matches \b at the end
178     of the subject when the last character is a letter, thus finding a complete
179     match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing
180 ph10 429 happens with <b>pcre_dfa_exec()</b>, because it also finds the complete match.
181     </P>
182     <P>
183 ph10 453 Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
184 ph10 429 then the partial match takes precedence.
185     </P>
186     <br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
187     <P>
188     For releases of PCRE prior to 8.00, because of the way certain internal
189     optimizations were implemented in the <b>pcre_exec()</b> function, the
190     PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
191     all patterns. From release 8.00 onwards, the restrictions no longer apply, and
192     partial matching with <b>pcre_exec()</b> can be requested for any pattern.
193     </P>
194     <P>
195     Items that were formerly restricted were repeated single characters and
196     repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
197     conform to the restrictions, <b>pcre_exec()</b> returned the error code
198     PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
199     PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out if a compiled
200     pattern can be used for partial matching now always returns 1.
201     </P>
202     <br><a name="SEC6" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
203     <P>
204 nigel 75 If the escape sequence \P is present in a <b>pcretest</b> data line, the
205 ph10 429 PCRE_PARTIAL_SOFT option is used for the match. Here is a run of <b>pcretest</b>
206     that uses the date example quoted above:
207 nigel 75 <pre>
208     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
209     data&#62; 25jun04\P
210     0: 25jun04
211     1: jun
212     data&#62; 25dec3\P
213 ph10 429 Partial match: 23dec3
214 nigel 75 data&#62; 3ju\P
215 ph10 429 Partial match: 3ju
216 nigel 75 data&#62; 3juj\P
217     No match
218     data&#62; j\P
219     No match
220     </pre>
221     The first data string is matched completely, so <b>pcretest</b> shows the
222     matched substrings. The remaining four strings do not match the complete
223 ph10 429 pattern, but the first two are partial matches. Similar output is obtained
224     when <b>pcre_dfa_exec()</b> is used.
225 nigel 75 </P>
226     <P>
227 ph10 429 If the escape sequence \P is present more than once in a <b>pcretest</b> data
228     line, the PCRE_PARTIAL_HARD option is set for the match.
229     </P>
230     <br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
231     <P>
232 nigel 77 When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
233     to continue the match by providing additional subject data and calling
234 nigel 93 <b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
235 ph10 429 time setting the PCRE_DFA_RESTART option. You must pass the same working
236 nigel 93 space as before, because this is where details of the previous partial match
237     are stored. Here is an example using <b>pcretest</b>, using the \R escape
238 ph10 429 sequence to set the PCRE_DFA_RESTART option (\D specifies the use of
239     <b>pcre_dfa_exec()</b>):
240 nigel 77 <pre>
241     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
242     data&#62; 23ja\P\D
243     Partial match: 23ja
244     data&#62; n05\R\D
245     0: n05
246     </pre>
247     The first call has "23ja" as the subject, and requests partial matching; the
248     second call has "n05" as the subject for the continued (restarted) match.
249     Notice that when the match is complete, only the last part is shown; PCRE does
250     not retain the previously partially-matched string. It is up to the calling
251     program to do that if it needs to.
252     </P>
253     <P>
254 ph10 429 You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
255     PCRE_DFA_RESTART to continue partial matching over multiple segments. This
256     facility can be used to pass very long subject strings to
257     <b>pcre_dfa_exec()</b>.
258 nigel 77 </P>
259 ph10 429 <br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec()</a><br>
260 nigel 77 <P>
261 ph10 453 From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment
262     matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
263     previous match with a new segment of data. Instead, new data must be added to
264     the previous subject string, and the entire match re-run, starting from the
265 ph10 429 point where the partial match occurred. Earlier data can be discarded.
266     Consider an unanchored pattern that matches dates:
267     <pre>
268     re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
269     data&#62; The date is 23ja\P
270     Partial match: 23ja
271     </pre>
272 ph10 453 The this stage, an application could discard the text preceding "23ja", add on
273     text from the next segment, and call <b>pcre_exec()</b> again. Unlike
274     <b>pcre_dfa_exec()</b>, the entire matching string must always be available, and
275     the complete matching process occurs for each call, so more memory and more
276 ph10 429 processing time is needed.
277     </P>
278 ph10 453 <P>
279     <b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
280     with \b or \B, the string that is returned for a partial match will include
281     characters that precede the partially matched string itself, because these must
282     be retained when adding on more characters for a subsequent matching attempt.
283     </P>
284 ph10 429 <br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
285     <P>
286 ph10 453 Certain types of pattern may give problems with multi-segment matching,
287 ph10 429 whichever matching function is used.
288     </P>
289     <P>
290 nigel 77 1. If the pattern contains tests for the beginning or end of a line, you need
291     to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
292     subject string for any call does not contain the beginning or end of a line.
293     </P>
294     <P>
295 ph10 453 2. Lookbehind assertions at the start of a pattern are catered for in the
296     offsets that are returned for a partial match. However, in theory, a lookbehind
297     assertion later in the pattern could require even earlier characters to be
298     inspected, and it might not have been reached when a partial match occurs. This
299     is probably an extremely unlikely case; you could guard against it to a certain
300     extent by always including extra characters at the start.
301 nigel 77 </P>
302     <P>
303 ph10 429 3. Matching a subject string that is split into multiple segments may not
304     always produce exactly the same result as matching over one single long string,
305 ph10 453 especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
306     Word Boundaries" above describes an issue that arises if the pattern ends with
307 ph10 429 \b or \B. Another kind of difference may occur when there are multiple
308     matching possibilities, because a partial match result is given only when there
309     are no completed matches. This means that as soon as the shortest match has
310 nigel 77 been found, continuation to a new subject segment is no longer possible.
311 ph10 429 Consider again this <b>pcretest</b> example:
312 nigel 77 <pre>
313     re&#62; /dog(sbody)?/
314 ph10 429 data&#62; dogsb\P
315 ph10 453 0: dog
316 nigel 77 data&#62; do\P\D
317     Partial match: do
318     data&#62; gsb\R\P\D
319     0: g
320     data&#62; dogsbody\D
321     0: dogsbody
322     1: dog
323     </pre>
324 ph10 429 The first data line passes the string "dogsb" to <b>pcre_exec()</b>, setting the
325     PCRE_PARTIAL_SOFT option. Although the string is a partial match for
326     "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string
327     "dog" is a complete match. Similarly, when the subject is presented to
328     <b>pcre_dfa_exec()</b> in several parts ("do" and "gsb" being the first two) the
329     match stops when "dog" has been found, and it is not possible to continue. On
330     the other hand, if "dogsbody" is presented as a single string,
331     <b>pcre_dfa_exec()</b> finds both matches.
332 nigel 77 </P>
333     <P>
334 ph10 429 Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when
335     matching multi-segment data. The example above then behaves differently:
336     <pre>
337     re&#62; /dog(sbody)?/
338     data&#62; dogsb\P\P
339 ph10 453 Partial match: dogsb
340 ph10 429 data&#62; do\P\D
341     Partial match: do
342     data&#62; gsb\R\P\P\D
343 ph10 453 Partial match: gsb
344 ph10 429
345     </PRE>
346 nigel 77 </P>
347     <P>
348 nigel 87 4. Patterns that contain alternatives at the top level which do not all
349 ph10 453 start with the same pattern item may not work as expected when
350 ph10 429 <b>pcre_dfa_exec()</b> is used. For example, consider this pattern:
351 nigel 87 <pre>
352     1234|3789
353     </pre>
354     If the first part of the subject is "ABC123", a partial match of the first
355     alternative is found at offset 3. There is no partial match for the second
356     alternative, because such a match does not start at the same point in the
357 ph10 429 subject string. Attempting to continue with the string "7890" does not yield a
358 nigel 87 match because only those alternatives that match at one point in the subject
359     are remembered. The problem arises because the start of the second alternative
360     matches within the first alternative. There is no problem with anchored
361     patterns or patterns such as:
362     <pre>
363     1234|ABCD
364     </pre>
365 ph10 429 where no string can be a partial match for both alternatives. This is not a
366 ph10 453 problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
367 ph10 429 each time:
368     <pre>
369     re&#62; /1234|3789/
370     data&#62; ABC123\P
371     Partial match: 123
372     data&#62; 1237890
373     0: 3789
374    
375     </PRE>
376 nigel 87 </P>
377 ph10 429 <br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
378 nigel 87 <P>
379 ph10 99 Philip Hazel
380 nigel 75 <br>
381 ph10 99 University Computing Service
382     <br>
383     Cambridge CB2 3QH, England.
384     <br>
385     </P>
386 ph10 429 <br><a name="SEC11" href="#TOC1">REVISION</a><br>
387 ph10 99 <P>
388 ph10 461 Last updated: 29 September 2009
389 ph10 99 <br>
390 ph10 429 Copyright &copy; 1997-2009 University of Cambridge.
391 ph10 99 <br>
392 nigel 75 <p>
393     Return to the <a href="index.html">PCRE index page</a>.
394     </p>

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12