/[pcre]/code/trunk/doc/pcrepartial.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 425 by ph10, Tue Jun 5 10:40:13 2007 UTC revision 426 by ph10, Wed Aug 26 15:38:32 2009 UTC
# Line 25  entered. Line 25  entered.
25  .P  .P
26  PCRE supports the concept of partial matching by means of the PCRE_PARTIAL  PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
27  option, which can be set when calling \fBpcre_exec()\fP or  option, which can be set when calling \fBpcre_exec()\fP or
28  \fBpcre_dfa_exec()\fP. When this flag is set for \fBpcre_exec()\fP, the return  \fBpcre_dfa_exec()\fP.
29  code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time  .P
30  during the matching process the last part of the subject string matched part of  When PCRE_PARTIAL is set for \fBpcre_exec()\fP, the return code
31  the pattern. Unfortunately, for non-anchored matching, it is not possible to  PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time during
32  obtain the position of the start of the partial match. No captured data is set  the matching process the last part of the subject string matched part of the
33  when PCRE_ERROR_PARTIAL is returned.  pattern. If there are at least two slots in the offsets vector, they are filled
34    in with the offsets of the longest found string that partially matched. No
35    other captured data is set when PCRE_ERROR_PARTIAL is returned. The second
36    offset is always that for the end of the subject. Consider this pattern:
37    .sp
38      /123\ew+X|dogY/
39    .sp
40    If this is matched against the subject string "abc123dog", both
41    alternatives fail to match, but the end of the subject is reached, so
42    PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH if the
43    PCRE_PARTIAL option is set. The offsets are set to 3 and 9, identifying
44    "123dog" as the longest partial match that was found. (In this example, there
45    are two partial matches, because "dog" on its own partially matches the second
46    alternative.)
47  .P  .P
48  When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code  When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code
49  PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the  PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
50  subject is reached, there have been no complete matches, but there is still at  subject is reached, there have been no complete matches, but there is still at
51  least one matching possibility. The portion of the string that provided the  least one matching possibility. The portion of the string that provided the
52  partial match is set as the first matching string.  longest partial match is set as the first matching string, provided there are
53    at least two slots in the offsets vector.
54  .P  .P
55  Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the  Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
56  last literal byte in a pattern, and abandons matching immediately if such a  last literal byte in a pattern, and abandons matching immediately if such a
# Line 44  byte is not present in the subject strin Line 58  byte is not present in the subject strin
58  for a subject string that might match only partially.  for a subject string that might match only partially.
59  .  .
60  .  .
61  .SH "RESTRICTED PATTERNS FOR PCRE_PARTIAL"  .SH "FORMERLY RESTRICTED PATTERNS FOR PCRE_PARTIAL"
62  .rs  .rs
63  .sp  .sp
64  Because of the way certain internal optimizations are implemented in the  For releases of PCRE prior to 8.00, because of the way certain internal
65  \fBpcre_exec()\fP function, the PCRE_PARTIAL option cannot be used with all  optimizations were implemented in the \fBpcre_exec()\fP function, the
66  patterns. These restrictions do not apply when \fBpcre_dfa_exec()\fP is used.  PCRE_PARTIAL option could not be used with all patterns. From release 8.00
67  For \fBpcre_exec()\fP, repeated single characters such as  onwards, the restrictions no longer apply, and partial matching can be
68  .sp  requested for any pattern.
69    a{2,4}  .P
70  .sp  Items that were formerly restricted were repeated single characters and
71  and repeated single metasequences such as  repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
72  .sp  conform to the restrictions, \fBpcre_exec()\fP returned the error code
73    \ed+  PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
74  .sp  PCRE_INFO_OKPARTIAL call to \fBpcre_fullinfo()\fP to find out if a compiled
75  are not permitted if the maximum number of occurrences is greater than one.  pattern can be used for partial matching now always returns 1.
 Optional items such as \ed? (where the maximum is one) are permitted.  
 Quantifiers with any values are permitted after parentheses, so the invalid  
 examples above can be coded thus:  
 .sp  
   (a){2,4}  
   (\ed)+  
 .sp  
 These constructions run more slowly, but for the kinds of application that are  
 envisaged for this facility, this is not felt to be a major restriction.  
 .P  
 If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,  
 \fBpcre_exec()\fP returns the error code PCRE_ERROR_BADPARTIAL (-13).  
 You can use the PCRE_INFO_OKPARTIAL call to \fBpcre_fullinfo()\fP to find out  
 if a compiled pattern can be used for partial matching.  
76  .  .
77  .  .
78  .SH "EXAMPLE OF PARTIAL MATCHING USING PCRETEST"  .SH "EXAMPLE OF PARTIAL MATCHING USING PCRETEST"
# Line 87  uses the date example quoted above: Line 87  uses the date example quoted above:
87     0: 25jun04     0: 25jun04
88     1: jun     1: jun
89    data> 25dec3\eP    data> 25dec3\eP
90    Partial match    Partial match: 23dec3
91    data> 3ju\eP    data> 3ju\eP
92    Partial match    Partial match: 3ju
93    data> 3juj\eP    data> 3juj\eP
94    No match    No match
95    data> j\eP    data> j\eP
# Line 97  uses the date example quoted above: Line 97  uses the date example quoted above:
97  .sp  .sp
98  The first data string is matched completely, so \fBpcretest\fP shows the  The first data string is matched completely, so \fBpcretest\fP shows the
99  matched substrings. The remaining four strings do not match the complete  matched substrings. The remaining four strings do not match the complete
100  pattern, but the first two are partial matches. The same test, using  pattern, but the first two are partial matches. Similar output is obtained
101  \fBpcre_dfa_exec()\fP matching (by means of the \eD escape sequence), produces  when \fBpcre_dfa_exec()\fP is used.
102  the following output:  .
103  .sp  .
104      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/  .SH "ISSUES WITH PARTIAL MATCHING"
105    data> 25jun04\eP\eD  .rs
    0: 25jun04  
   data> 23dec3\eP\eD  
   Partial match: 23dec3  
   data> 3ju\eP\eD  
   Partial match: 3ju  
   data> 3juj\eP\eD  
   No match  
   data> j\eP\eD  
   No match  
106  .sp  .sp
107  Notice that in this case the portion of the string that was matched is made  Certain types of pattern may behave in unintuitive ways when partial matching
108  available.  is requested, whichever matching function is used. For example, matching a
109    pattern that ends with (*FAIL), or any other assertion that causes a match to
110    fail without inspecting any data, yields PCRE_ERROR_PARTIAL rather than
111    PCRE_ERROR_NOMATCH:
112    .sp
113        re> /a+(*FAIL)/
114      data> aaa\eP
115      Partial match: aaa
116    .sp
117    Although (*FAIL) itself could possibly be made a special case, there are other
118    assertions, for example (?!), which behave in the same way, and it is not
119    possible to catch all cases. For consistency, therefore, there are no
120    exceptions to the rule that PCRE_ERROR_PARTIAL is returned instead of
121    PCRE_ERROR_NOMATCH if at any time during the match the end of the subject
122    string was reached.
123  .  .
124  .  .
125  .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"  .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
# Line 126  to continue the match by providing addit Line 131  to continue the match by providing addit
131  time setting the PCRE_DFA_RESTART option. You must also pass the same working  time setting the PCRE_DFA_RESTART option. You must also pass the same working
132  space as before, because this is where details of the previous partial match  space as before, because this is where details of the previous partial match
133  are stored. Here is an example using \fBpcretest\fP, using the \eR escape  are stored. Here is an example using \fBpcretest\fP, using the \eR escape
134  sequence to set the PCRE_DFA_RESTART option (\eP and \eD are as above):  sequence to set the PCRE_DFA_RESTART option (\eP sets the PCRE_PARTIAL option,
135    and \eD specifies the use of \fBpcre_dfa_exec()\fP):
136  .sp  .sp
137      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
138    data> 23ja\eP\eD    data> 23ja\eP\eD
# Line 142  program to do that if it needs to. Line 148  program to do that if it needs to.
148  .P  .P
149  You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching  You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
150  over multiple segments. This facility can be used to pass very long subject  over multiple segments. This facility can be used to pass very long subject
151  strings to \fBpcre_dfa_exec()\fP. However, some care is needed for certain  strings to \fBpcre_dfa_exec()\fP.
152  types of pattern.  .
153    .
154    .SH "MULTI-SEGMENT MATCHING WITH pcre_exec()"
155    .rs
156    .sp
157    From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment
158    matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
159    previous match with a new segment of data. Instead, new data must be added to
160    the previous subject string, and the entire match re-run, starting from the
161    point where the partial match occurred. Earlier data can be discarded.
162    Consider an unanchored pattern that matches dates:
163    .sp
164        re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
165      data> The date is 23ja\eP
166      Partial match: 23ja
167    .sp
168    The this stage, an application could discard the text preceding "23ja", add on
169    text from the next segment, and call \fBpcre_exec()\fP again. Unlike
170    \fBpcre_dfa_exec()\fP, the entire matching string must always be available, and
171    the complete matching process occurs for each call, so more memory and more
172    processing time is needed.
173    .
174    .
175    .SH "ISSUES WITH MULTI-SEGMENT MATCHING"
176    .rs
177    .sp
178    Certain types of pattern may give problems with multi-segment matching,
179    whichever matching function is used.
180  .P  .P
181  1. If the pattern contains tests for the beginning or end of a line, you need  1. If the pattern contains tests for the beginning or end of a line, you need
182  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
# Line 151  subject string for any call does not con Line 184  subject string for any call does not con
184  .P  .P
185  2. If the pattern contains backward assertions (including \eb or \eB), you need  2. If the pattern contains backward assertions (including \eb or \eB), you need
186  to arrange for some overlap in the subject strings to allow for this. For  to arrange for some overlap in the subject strings to allow for this. For
187  example, you could pass the subject in chunks that are 500 bytes long, but in  example, using \fBpcre_dfa_exec()\fP, you could pass the subject in chunks that
188  a buffer of 700 bytes, with the starting offset set to 200 and the previous 200  are 500 bytes long, but in a buffer of 700 bytes, with the starting offset set
189  bytes at the start of the buffer.  to 200 and the previous 200 bytes at the start of the buffer.
190  .P  .P
191  3. Matching a subject string that is split into multiple segments does not  3. Matching a subject string that is split into multiple segments does not
192  always produce exactly the same result as matching over one single long string.  always produce exactly the same result as matching over one single long string.
193  The difference arises when there are multiple matching possibilities, because a  The difference arises when there are multiple matching possibilities, because a
194  partial match result is given only when there are no completed matches in a  partial match result is given only when there are no completed matches. This
195  call to \fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has  means that as soon as the shortest match has been found, continuation to a new
196  been found, continuation to a new subject segment is no longer possible.  subject segment is no longer possible. Consider this \fBpcretest\fP example:
 Consider this \fBpcretest\fP example:  
197  .sp  .sp
198      re> /dog(sbody)?/      re> /dog(sbody)?/
199      data> dogsb\eP
200       0: dog
201    data> do\eP\eD    data> do\eP\eD
202    Partial match: do    Partial match: do
203    data> gsb\eR\eP\eD    data> gsb\eR\eP\eD
# Line 172  Consider this \fBpcretest\fP example: Line 206  Consider this \fBpcretest\fP example:
206     0: dogsbody     0: dogsbody
207     1: dog     1: dog
208  .sp  .sp
209  The pattern matches the words "dog" or "dogsbody". When the subject is  The pattern matches "dog" or "dogsbody". The first data line passes the string
210  presented in several parts ("do" and "gsb" being the first two) the match stops  "dogsb" to \fBpcre_exec()\fP, setting the PCRE_PARTIAL option. Although the
211  when "dog" has been found, and it is not possible to continue. On the other  string is a partial match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
212  hand, if "dogsbody" is presented as a single string, both matches are found.  because the shorter string "dog" is a complete match. Similarly, when the
213    subject is presented to \fBpcre_dfa_exec()\fP in several parts ("do" and "gsb"
214    being the first two) the match stops when "dog" has been found, and it is not
215    possible to continue. On the other hand, if "dogsbody" is presented as a single
216    string, \fBpcre_dfa_exec()\fP finds both matches.
217  .P  .P
218  Because of this phenomenon, it does not usually make sense to end a pattern  Because of this phenomenon, it does not usually make sense to end a pattern
219  that is going to be matched in this way with a variable repeat.  that is going to be matched in this way with a variable repeat.
220  .P  .P
221  4. Patterns that contain alternatives at the top level which do not all  4. Patterns that contain alternatives at the top level which do not all
222  start with the same pattern item may not work as expected. For example,  start with the same pattern item may not work as expected when
223  consider this pattern:  \fBpcre_dfa_exec()\fP is used. For example, consider this pattern:
224  .sp  .sp
225    1234|3789    1234|3789
226  .sp  .sp
227  If the first part of the subject is "ABC123", a partial match of the first  If the first part of the subject is "ABC123", a partial match of the first
228  alternative is found at offset 3. There is no partial match for the second  alternative is found at offset 3. There is no partial match for the second
229  alternative, because such a match does not start at the same point in the  alternative, because such a match does not start at the same point in the
230  subject string. Attempting to continue with the string "789" does not yield a  subject string. Attempting to continue with the string "7890" does not yield a
231  match because only those alternatives that match at one point in the subject  match because only those alternatives that match at one point in the subject
232  are remembered. The problem arises because the start of the second alternative  are remembered. The problem arises because the start of the second alternative
233  matches within the first alternative. There is no problem with anchored  matches within the first alternative. There is no problem with anchored
# Line 197  patterns or patterns such as: Line 235  patterns or patterns such as:
235  .sp  .sp
236    1234|ABCD    1234|ABCD
237  .sp  .sp
238  where no string can be a partial match for both alternatives.  where no string can be a partial match for both alternatives. This is not a
239    problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
240    each time:
241    .sp
242        re> /1234|3789/
243      data> ABC123\eP
244      Partial match: 123
245      data> 1237890
246       0: 3789
247    .sp
248  .  .
249  .  .
250  .SH AUTHOR  .SH AUTHOR
# Line 214  Cambridge CB2 3QH, England. Line 261  Cambridge CB2 3QH, England.
261  .rs  .rs
262  .sp  .sp
263  .nf  .nf
264  Last updated: 04 June 2007  Last updated: 26 August 2009
265  Copyright (c) 1997-2007 University of Cambridge.  Copyright (c) 1997-2009 University of Cambridge.
266  .fi  .fi

Legend:
Removed from v.425  
changed lines
  Added in v.426

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12