ViewVC logotype

Contents of /code/trunk/doc/html/pcrepartial.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 87 - (hide annotations) (download) (as text)
Sat Feb 24 21:41:21 2007 UTC (8 years, 2 months ago) by nigel
File MIME type: text/html
File size: 9278 byte(s)
Load pcre-6.5 into code/trunk.

1 nigel 75 <html>
2     <head>
3     <title>pcrepartial specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6     <h1>pcrepartial man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10     <p>
11     This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14     <br>
15     <ul>
16     <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
17     <li><a name="TOC2" href="#SEC2">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a>
19 nigel 77 <li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
20 nigel 75 </ul>
21     <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
22     <P>
23     In normal use of PCRE, if the subject string that is passed to
24 nigel 77 <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
25     too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
26     are circumstances where it might be helpful to distinguish this case from other
27     cases in which there is no match.
28 nigel 75 </P>
29     <P>
30     Consider, for example, an application where a human is required to type in data
31     for a field with specific formatting requirements. An example might be a date
32     in the form <i>ddmmmyy</i>, defined by this pattern:
33     <pre>
34     ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
35     </pre>
36     If the application sees the user's keystrokes one by one, and can check that
37     what has been typed so far is potentially valid, it is able to raise an error
38     as soon as a mistake is made, possibly beeping and not reflecting the
39     character that has been typed. This immediate feedback is likely to be a better
40     user interface than a check that is delayed until the entire string has been
41     entered.
42     </P>
43     <P>
44     PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
45 nigel 77 option, which can be set when calling <b>pcre_exec()</b> or
46     <b>pcre_dfa_exec()</b>. When this flag is set for <b>pcre_exec()</b>, the return
47     code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
48     during the matching process the last part of the subject string matched part of
49     the pattern. Unfortunately, for non-anchored matching, it is not possible to
50     obtain the position of the start of the partial match. No captured data is set
51     when PCRE_ERROR_PARTIAL is returned.
52 nigel 75 </P>
53     <P>
54 nigel 77 When PCRE_PARTIAL is set for <b>pcre_dfa_exec()</b>, the return code
55     PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
56     subject is reached, there have been no complete matches, but there is still at
57     least one matching possibility. The portion of the string that provided the
58     partial match is set as the first matching string.
59     </P>
60     <P>
61 nigel 75 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
62     last literal byte in a pattern, and abandons matching immediately if such a
63     byte is not present in the subject string. This optimization cannot be used
64     for a subject string that might match only partially.
65     </P>
66     <br><a name="SEC2" href="#TOC1">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a><br>
67     <P>
68 nigel 77 Because of the way certain internal optimizations are implemented in the
69     <b>pcre_exec()</b> function, the PCRE_PARTIAL option cannot be used with all
70     patterns. These restrictions do not apply when <b>pcre_dfa_exec()</b> is used.
71     For <b>pcre_exec()</b>, repeated single characters such as
72 nigel 75 <pre>
73     a{2,4}
74     </pre>
75     and repeated single metasequences such as
76     <pre>
77     \d+
78     </pre>
79     are not permitted if the maximum number of occurrences is greater than one.
80     Optional items such as \d? (where the maximum is one) are permitted.
81     Quantifiers with any values are permitted after parentheses, so the invalid
82     examples above can be coded thus:
83     <pre>
84     (a){2,4}
85     (\d)+
86     </pre>
87     These constructions run more slowly, but for the kinds of application that are
88     envisaged for this facility, this is not felt to be a major restriction.
89     </P>
90     <P>
91     If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
92     <b>pcre_exec()</b> returns the error code PCRE_ERROR_BADPARTIAL (-13).
93     </P>
94     <br><a name="SEC3" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
95     <P>
96     If the escape sequence \P is present in a <b>pcretest</b> data line, the
97     PCRE_PARTIAL flag is used for the match. Here is a run of <b>pcretest</b> that
98     uses the date example quoted above:
99     <pre>
100     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
101     data&#62; 25jun04\P
102     0: 25jun04
103     1: jun
104     data&#62; 25dec3\P
105     Partial match
106     data&#62; 3ju\P
107     Partial match
108     data&#62; 3juj\P
109     No match
110     data&#62; j\P
111     No match
112     </pre>
113     The first data string is matched completely, so <b>pcretest</b> shows the
114     matched substrings. The remaining four strings do not match the complete
115 nigel 77 pattern, but the first two are partial matches. The same test, using DFA
116     matching (by means of the \D escape sequence), produces the following output:
117     <pre>
118     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
119     data&#62; 25jun04\P\D
120     0: 25jun04
121     data&#62; 23dec3\P\D
122     Partial match: 23dec3
123     data&#62; 3ju\P\D
124     Partial match: 3ju
125     data&#62; 3juj\P\D
126     No match
127     data&#62; j\P\D
128     No match
129     </pre>
130     Notice that in this case the portion of the string that was matched is made
131     available.
132 nigel 75 </P>
133 nigel 77 <br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
134 nigel 75 <P>
135 nigel 77 When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
136     to continue the match by providing additional subject data and calling
137     <b>pcre_dfa_exec()</b> again with the PCRE_DFA_RESTART option and the same
138     working space (where details of the previous partial match are stored). Here is
139     an example using <b>pcretest</b>, where the \R escape sequence sets the
140     PCRE_DFA_RESTART option and the \D escape sequence requests the use of
141     <b>pcre_dfa_exec()</b>:
142     <pre>
143     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
144     data&#62; 23ja\P\D
145     Partial match: 23ja
146     data&#62; n05\R\D
147     0: n05
148     </pre>
149     The first call has "23ja" as the subject, and requests partial matching; the
150     second call has "n05" as the subject for the continued (restarted) match.
151     Notice that when the match is complete, only the last part is shown; PCRE does
152     not retain the previously partially-matched string. It is up to the calling
153     program to do that if it needs to.
154     </P>
155     <P>
156     This facility can be used to pass very long subject strings to
157     <b>pcre_dfa_exec()</b>. However, some care is needed for certain types of
158     pattern.
159     </P>
160     <P>
161     1. If the pattern contains tests for the beginning or end of a line, you need
162     to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
163     subject string for any call does not contain the beginning or end of a line.
164     </P>
165     <P>
166     2. If the pattern contains backward assertions (including \b or \B), you need
167     to arrange for some overlap in the subject strings to allow for this. For
168     example, you could pass the subject in chunks that were 500 bytes long, but in
169     a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
170     bytes at the start of the buffer.
171     </P>
172     <P>
173     3. Matching a subject string that is split into multiple segments does not
174     always produce exactly the same result as matching over one single long string.
175     The difference arises when there are multiple matching possibilities, because a
176     partial match result is given only when there are no completed matches in a
177     call to fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
178     been found, continuation to a new subject segment is no longer possible.
179     Consider this <b>pcretest</b> example:
180     <pre>
181     re&#62; /dog(sbody)?/
182     data&#62; do\P\D
183     Partial match: do
184     data&#62; gsb\R\P\D
185     0: g
186     data&#62; dogsbody\D
187     0: dogsbody
188     1: dog
189     </pre>
190     The pattern matches the words "dog" or "dogsbody". When the subject is
191     presented in several parts ("do" and "gsb" being the first two) the match stops
192     when "dog" has been found, and it is not possible to continue. On the other
193     hand, if "dogsbody" is presented as a single string, both matches are found.
194     </P>
195     <P>
196     Because of this phenomenon, it does not usually make sense to end a pattern
197     that is going to be matched in this way with a variable repeat.
198     </P>
199     <P>
200 nigel 87 4. Patterns that contain alternatives at the top level which do not all
201     start with the same pattern item may not work as expected. For example,
202     consider this pattern:
203     <pre>
204     1234|3789
205     </pre>
206     If the first part of the subject is "ABC123", a partial match of the first
207     alternative is found at offset 3. There is no partial match for the second
208     alternative, because such a match does not start at the same point in the
209     subject string. Attempting to continue with the string "789" does not yield a
210     match because only those alternatives that match at one point in the subject
211     are remembered. The problem arises because the start of the second alternative
212     matches within the first alternative. There is no problem with anchored
213     patterns or patterns such as:
214     <pre>
215     1234|ABCD
216     </pre>
217     where no string can be a partial match for both alternatives.
218     </P>
219     <P>
220     Last updated: 16 January 2006
221 nigel 75 <br>
222 nigel 87 Copyright &copy; 1997-2006 University of Cambridge.
223 nigel 75 <p>
224     Return to the <a href="index.html">PCRE index page</a>.
225     </p>

ViewVC Help
Powered by ViewVC 1.1.12