ViewVC logotype

Contents of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log

Revision 858 - (hide annotations) (download)
Sun Jan 8 17:55:38 2012 UTC (3 years, 3 months ago) by ph10
File size: 18425 byte(s)
Documentation updates.

1 nigel 79 .TH PCREPARTIAL 3
2 nigel 75 .SH NAME
3     PCRE - Perl-compatible regular expressions
5     .rs
6     .sp
7 ph10 858 In normal use of PCRE, if the subject string that is passed to a matching
8     function matches as far as it goes, but is too short to match the entire
9     pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where it might
10     be helpful to distinguish this case from other cases in which there is no
11     match.
12 nigel 75 .P
13     Consider, for example, an application where a human is required to type in data
14     for a field with specific formatting requirements. An example might be a date
15     in the form \fIddmmmyy\fP, defined by this pattern:
16     .sp
17     ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
18     .sp
19     If the application sees the user's keystrokes one by one, and can check that
20     what has been typed so far is potentially valid, it is able to raise an error
21 ph10 428 as soon as a mistake is made, by beeping and not reflecting the character that
22     has been typed, for example. This immediate feedback is likely to be a better
23 nigel 75 user interface than a check that is delayed until the entire string has been
24 ph10 553 entered. Partial matching can also be useful when the subject string is very
25     long and is not all available at once.
26 nigel 75 .P
27 ph10 428 PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
28 ph10 858 PCRE_PARTIAL_HARD options, which can be set when calling any of the matching
29     functions. For backwards compatibility, PCRE_PARTIAL is a synonym for
30     PCRE_PARTIAL_SOFT. The essential difference between the two options is whether
31     or not a partial match is preferred to an alternative complete match, though
32     the details differ between the two types of matching function. If both options
33 ph10 428 are set, PCRE_PARTIAL_HARD takes precedence.
34 nigel 75 .P
35 ph10 858 Setting a partial matching option disables the use of any just-in-time code
36     that was set up by studying the compiled pattern with the
37 ph10 678 PCRE_STUDY_JIT_COMPILE option. It also disables two of PCRE's standard
38 ph10 858 optimizations. PCRE remembers the last literal data unit in a pattern, and
39     abandons matching immediately if it is not present in the subject string. This
40 ph10 678 optimization cannot be used for a subject string that might match only
41     partially. If the pattern was studied, PCRE knows the minimum length of a
42     matching string, and does not bother to run the matching function on shorter
43     strings. This optimization is also disabled for partial matching.
44 ph10 428 .
45     .
46 ph10 858 .SH "PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()"
47 ph10 428 .rs
48 ph10 426 .sp
49 ph10 858 A partial match occurs during a call to \fBpcre_exec()\fP or
50     \fBpcre16_exec()\fP when the end of the subject string is reached successfully,
51     but matching cannot continue because more characters are needed. However, at
52     least one character in the subject must have been inspected. This character
53     need not form part of the final matched string; lookbehind assertions and the
54     \eK escape sequence provide ways of inspecting characters before the start of a
55     matched substring. The requirement for inspecting at least one character exists
56     because an empty string can always be matched; without such a restriction there
57     would always be a partial match of an empty string at the end of the subject.
58 ph10 428 .P
59 ph10 858 If there are at least two slots in the offsets vector when a partial match is
60     returned, the first slot is set to the offset of the earliest character that
61     was inspected. For convenience, the second offset points to the end of the
62     subject so that a substring can easily be identified.
63 ph10 435 .P
64     For the majority of patterns, the first offset identifies the start of the
65     partially matched string. However, for patterns that contain lookbehind
66     assertions, or \eK, or begin with \eb or \eB, earlier characters have been
67     inspected while carrying out the match. For example:
68 ph10 428 .sp
69 ph10 435 /(?<=abc)123/
70     .sp
71     This pattern matches "123", but only if it is preceded by "abc". If the subject
72     string is "xyzabc12", the offsets after a partial match are for the substring
73     "abc12", because all these characters are needed if another match is tried
74 ph10 553 with extra characters added to the subject.
75 ph10 435 .P
76 ph10 553 What happens when a partial match is identified depends on which of the two
77 ph10 579 partial matching options are set.
78 ph10 553 .
79     .
80 ph10 858 .SS "PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre16_exec()"
81 ph10 553 .rs
82     .sp
83 ph10 858 If PCRE_PARTIAL_SOFT is set when \fBpcre_exec()\fP or \fBpcre16_exec()\fP
84     identifies a partial match, the partial match is remembered, but matching
85     continues as normal, and other alternatives in the pattern are tried. If no
86     complete match can be found, PCRE_ERROR_PARTIAL is returned instead of
88 ph10 553 .P
89 ph10 579 This option is "soft" because it prefers a complete match over a partial match.
90     All the various matching items in a pattern behave as if the subject string is
91 ph10 553 potentially complete. For example, \ez, \eZ, and $ match at the end of the
92 ph10 579 subject, as normal, and for \eb and \eB the end of the subject is treated as a
93 ph10 553 non-alphanumeric.
94     .P
95 ph10 435 If there is more than one partial match, the first one that was found provides
96     the data that is returned. Consider this pattern:
97     .sp
98 ph10 426 /123\ew+X|dogY/
99     .sp
100     If this is matched against the subject string "abc123dog", both
101 ph10 435 alternatives fail to match, but the end of the subject is reached during
102 ph10 553 matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
103     identifying "123dog" as the first partial match that was found. (In this
104     example, there are two partial matches, because "dog" on its own partially
105     matches the second alternative.)
106     .
107     .
108 ph10 858 .SS "PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre16_exec()"
109 ph10 553 .rs
110     .sp
111 ph10 858 If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP or \fBpcre16_exec()\fP,
112     PCRE_ERROR_PARTIAL is returned as soon as a partial match is found, without
113     continuing to search for possible complete matches. This option is "hard"
114     because it prefers an earlier partial match over a later complete match. For
115     this reason, the assumption is made that the end of the supplied subject string
116     may not be the true end of the available data, and so, if \ez, \eZ, \eb, \eB,
117     or $ are encountered at the end of the subject, the result is
119 ph10 569 .P
120 ph10 858 Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16
121     subject strings are checked for validity. Normally, an invalid sequence
122     causes the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16. However, in the
123     special case of a truncated character at the end of the subject,
124     PCRE_ERROR_SHORTUTF8 or PCRE_ERROR_SHORTUTF16 is returned when
125 ph10 569 PCRE_PARTIAL_HARD is set.
126 ph10 553 .
127     .
128     .SS "Comparing hard and soft partial matching"
129     .rs
130 ph10 428 .sp
131 ph10 553 The difference between the two partial matching options can be illustrated by a
132     pattern such as:
133     .sp
134 ph10 428 /dog(sbody)?/
135     .sp
136 ph10 435 This matches either "dog" or "dogsbody", greedily (that is, it prefers the
137 ph10 428 longer string if possible). If it is matched against the string "dog" with
138 ph10 435 PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
139     PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
140 ph10 428 if the pattern is made ungreedy the result is different:
141     .sp
142     /dog(sbody)??/
143     .sp
144 ph10 858 In this case the result is always a complete match because that is found first,
145     and matching never continues after finding a complete match. It might be easier
146     to follow this explanation by thinking of the two patterns like this:
147 ph10 428 .sp
148     /dog(sbody)?/ is the same as /dogsbody|dog/
149     /dog(sbody)??/ is the same as /dog|dogsbody/
150     .sp
151 ph10 858 The second pattern will never match "dogsbody", because it will always find the
152     shorter match first.
153 ph10 428 .
154     .
155 ph10 858 .SH "PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()"
156 ph10 428 .rs
157     .sp
158 ph10 858 The DFA functions move along the subject string character by character, without
159     backtracking, searching for all possible matches simultaneously. If the end of
160     the subject is reached before the end of the pattern, there is the possibility
161     of a partial match, again provided that at least one character has been
162     inspected.
163 nigel 77 .P
164 ph10 428 When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
165     have been no complete matches. Otherwise, the complete matches are returned.
166     However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
167 ph10 435 complete matches. The portion of the string that was inspected when the longest
168     partial match was found is set as the first matching string, provided there are
169     at least two slots in the offsets vector.
170 ph10 428 .P
171 ph10 858 Because the DFA functions always search for all possible matches, and there is
172     no difference between greedy and ungreedy repetition, their behaviour is
173     different from the standard functions when PCRE_PARTIAL_HARD is set. Consider
174     the string "dog" matched against the ungreedy pattern shown above:
175 ph10 428 .sp
176     /dog(sbody)??/
177     .sp
178 ph10 858 Whereas the standard functions stop as soon as they find the complete match for
179     "dog", the DFA functions also find the partial match for "dogsbody", and so
180     return that when PCRE_PARTIAL_HARD is set.
181 nigel 75 .
182     .
184 nigel 75 .rs
185     .sp
186 ph10 468 If a pattern ends with one of sequences \eb or \eB, which test for word
187 ph10 435 boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
188 ph10 428 results. Consider this pattern:
189     .sp
190     /\ebcat\eb/
191     .sp
192     This matches "cat", provided there is a word boundary at either end. If the
193     subject string is "the cat", the comparison of the final "t" with a following
194 ph10 858 character cannot take place, so a partial match is found. However, normal
195     matching carries on, and \eb matches at the end of the subject when the last
196     character is a letter, so a complete match is found. The result, therefore, is
197     \fInot\fP PCRE_ERROR_PARTIAL. Using PCRE_PARTIAL_HARD in this case does yield
198     PCRE_ERROR_PARTIAL, because then the partial match takes precedence.
199 ph10 428 .
200     .
202     .rs
203     .sp
204 ph10 426 For releases of PCRE prior to 8.00, because of the way certain internal
205     optimizations were implemented in the \fBpcre_exec()\fP function, the
206 ph10 428 PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
207     all patterns. From release 8.00 onwards, the restrictions no longer apply, and
208 ph10 858 partial matching with can be requested for any pattern.
209 nigel 75 .P
210 ph10 426 Items that were formerly restricted were repeated single characters and
211     repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
212     conform to the restrictions, \fBpcre_exec()\fP returned the error code
213     PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
214     PCRE_INFO_OKPARTIAL call to \fBpcre_fullinfo()\fP to find out if a compiled
215     pattern can be used for partial matching now always returns 1.
216 nigel 75 .
217     .
219     .rs
220     .sp
221     If the escape sequence \eP is present in a \fBpcretest\fP data line, the
222 ph10 428 PCRE_PARTIAL_SOFT option is used for the match. Here is a run of \fBpcretest\fP
223     that uses the date example quoted above:
224 nigel 75 .sp
225     re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
226 nigel 87 data> 25jun04\eP
227 nigel 75 0: 25jun04
228     1: jun
229 nigel 87 data> 25dec3\eP
230 ph10 426 Partial match: 23dec3
231 nigel 87 data> 3ju\eP
232 ph10 426 Partial match: 3ju
233 nigel 87 data> 3juj\eP
234 nigel 75 No match
235 nigel 87 data> j\eP
236 nigel 75 No match
237     .sp
238     The first data string is matched completely, so \fBpcretest\fP shows the
239     matched substrings. The remaining four strings do not match the complete
240 ph10 426 pattern, but the first two are partial matches. Similar output is obtained
241 ph10 858 if DFA matching is used.
242 ph10 428 .P
243     If the escape sequence \eP is present more than once in a \fBpcretest\fP data
244     line, the PCRE_PARTIAL_HARD option is set for the match.
245 ph10 426 .
246 ph10 435 .
247 ph10 858 .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()"
248 nigel 77 .rs
249     .sp
250 ph10 858 When a partial match has been found using a DFA matching function, it is
251     possible to continue the match by providing additional subject data and calling
252     the function again with the same compiled regular expression, this time setting
253     the PCRE_DFA_RESTART option. You must pass the same working space as before,
254     because this is where details of the previous partial match are stored. Here is
255     an example using \fBpcretest\fP, using the \eR escape sequence to set the
256     PCRE_DFA_RESTART option (\eD specifies the use of the DFA matching function):
257 nigel 77 .sp
258 ph10 155 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
259 nigel 77 data> 23ja\eP\eD
260     Partial match: 23ja
261     data> n05\eR\eD
262     0: n05
263     .sp
264     The first call has "23ja" as the subject, and requests partial matching; the
265     second call has "n05" as the subject for the continued (restarted) match.
266     Notice that when the match is complete, only the last part is shown; PCRE does
267     not retain the previously partially-matched string. It is up to the calling
268     program to do that if it needs to.
269 nigel 75 .P
270 ph10 428 You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
271     PCRE_DFA_RESTART to continue partial matching over multiple segments. This
272 ph10 858 facility can be used to pass very long subject strings to the DFA matching
273     functions.
274 ph10 426 .
275     .
276 ph10 858 .SH "MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()"
277 ph10 426 .rs
278     .sp
279 ph10 858 From release 8.00, the standard matching functions can also be used to do
280     multi-segment matching. Unlike the DFA functions, it is not possible to
281     restart the previous match with a new segment of data. Instead, new data must
282     be added to the previous subject string, and the entire match re-run, starting
283     from the point where the partial match occurred. Earlier data can be discarded.
284     .P
285     It is best to use PCRE_PARTIAL_HARD in this situation, because it does not
286     treat the end of a segment as the end of the subject when matching \ez, \eZ,
287     \eb, \eB, and $. Consider an unanchored pattern that matches dates:
288 ph10 426 .sp
289     re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
290 ph10 553 data> The date is 23ja\eP\eP
291 ph10 426 Partial match: 23ja
292     .sp
293 ph10 468 At this stage, an application could discard the text preceding "23ja", add on
294 ph10 858 text from the next segment, and call the matching function again. Unlike the
295     DFA matching functions the entire matching string must always be available, and
296 ph10 435 the complete matching process occurs for each call, so more memory and more
297 ph10 426 processing time is needed.
298 ph10 435 .P
299     \fBNote:\fP If the pattern contains lookbehind assertions, or \eK, or starts
300 ph10 858 with \eb or \eB, the string that is returned for a partial match includes
301 ph10 435 characters that precede the partially matched string itself, because these must
302     be retained when adding on more characters for a subsequent matching attempt.
303 ph10 426 .
304 ph10 435 .
306     .rs
307     .sp
308 ph10 435 Certain types of pattern may give problems with multi-segment matching,
309 ph10 426 whichever matching function is used.
310 nigel 77 .P
311 ph10 553 1. If the pattern contains a test for the beginning of a line, you need to pass
312     the PCRE_NOTBOL option when the subject string for any call does start at the
313 ph10 579 beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
314     doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
315 ph10 553 includes the effect of PCRE_NOTEOL.
316 nigel 77 .P
317 ph10 435 2. Lookbehind assertions at the start of a pattern are catered for in the
318     offsets that are returned for a partial match. However, in theory, a lookbehind
319     assertion later in the pattern could require even earlier characters to be
320     inspected, and it might not have been reached when a partial match occurs. This
321     is probably an extremely unlikely case; you could guard against it to a certain
322     extent by always including extra characters at the start.
323 nigel 77 .P
324 ph10 428 3. Matching a subject string that is split into multiple segments may not
325     always produce exactly the same result as matching over one single long string,
326 ph10 435 especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
327     Word Boundaries" above describes an issue that arises if the pattern ends with
328 ph10 428 \eb or \eB. Another kind of difference may occur when there are multiple
329 ph10 553 matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
330     is given only when there are no completed matches. This means that as soon as
331     the shortest match has been found, continuation to a new subject segment is no
332     longer possible. Consider again this \fBpcretest\fP example:
333 nigel 77 .sp
334     re> /dog(sbody)?/
335 ph10 426 data> dogsb\eP
336 ph10 435 0: dog
337 nigel 77 data> do\eP\eD
338     Partial match: do
339     data> gsb\eR\eP\eD
340     0: g
341     data> dogsbody\eD
342     0: dogsbody
343     1: dog
344     .sp
345 ph10 858 The first data line passes the string "dogsb" to a standard matching function,
346     setting the PCRE_PARTIAL_SOFT option. Although the string is a partial match
347     for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter
348     string "dog" is a complete match. Similarly, when the subject is presented to
349     a DFA matching function in several parts ("do" and "gsb" being the first two)
350     the match stops when "dog" has been found, and it is not possible to continue.
351     On the other hand, if "dogsbody" is presented as a single string, a DFA
352     matching function finds both matches.
353 nigel 77 .P
354 ph10 553 Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
355     multi-segment data. The example above then behaves differently:
356 ph10 428 .sp
357     re> /dog(sbody)?/
358     data> dogsb\eP\eP
359 ph10 435 Partial match: dogsb
360 ph10 428 data> do\eP\eD
361     Partial match: do
362     data> gsb\eR\eP\eP\eD
363 ph10 435 Partial match: gsb
364 ph10 428 .sp
365 ph10 858 4. Patterns that contain alternatives at the top level which do not all start
366     with the same pattern item may not work as expected when PCRE_DFA_RESTART is
367     used. For example, consider this pattern:
368 nigel 87 .sp
369     1234|3789
370     .sp
371     If the first part of the subject is "ABC123", a partial match of the first
372     alternative is found at offset 3. There is no partial match for the second
373     alternative, because such a match does not start at the same point in the
374 ph10 426 subject string. Attempting to continue with the string "7890" does not yield a
375 nigel 87 match because only those alternatives that match at one point in the subject
376     are remembered. The problem arises because the start of the second alternative
377     matches within the first alternative. There is no problem with anchored
378     patterns or patterns such as:
379     .sp
380     1234|ABCD
381     .sp
382 ph10 426 where no string can be a partial match for both alternatives. This is not a
383 ph10 858 problem if a standard matching function is used, because the entire match has
384     to be rerun each time:
385 ph10 426 .sp
386     re> /1234|3789/
387 ph10 553 data> ABC123\eP\eP
388 ph10 426 Partial match: 123
389     data> 1237890
390     0: 3789
391 ph10 435 .sp
392 ph10 553 Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
393 ph10 858 the entire match can also be used with the DFA matching functions. Another
394 ph10 469 possibility is to work with two buffers. If a partial match at offset \fIn\fP
395     in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
396     the second buffer, you can then try a new match starting at offset \fIn+1\fP in
397 ph10 468 the first buffer.
398 nigel 77 .
399     .
400 ph10 99 .SH AUTHOR
401     .rs
402     .sp
403     .nf
404     Philip Hazel
405     University Computing Service
406     Cambridge CB2 3QH, England.
407     .fi
408     .
409     .
410     .SH REVISION
411     .rs
412     .sp
413     .nf
414 ph10 858 Last updated: 08 January 2012
415     Copyright (c) 1997-2012 University of Cambridge.
416 ph10 99 .fi


Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

ViewVC Help
Powered by ViewVC 1.1.12