/[pcre]/code/trunk/doc/html/pcrepartial.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcrepartial.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 96 - (hide annotations) (download) (as text)
Fri Mar 2 13:10:43 2007 UTC (6 years, 2 months ago) by nigel
File MIME type: text/html
File size: 9451 byte(s)
 r6896@hex:  nm | 2007-03-02 13:09:14 +0000
 Added EOL and keywork properties throughout

1 nigel 75 <html>
2     <head>
3     <title>pcrepartial specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6     <h1>pcrepartial man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10     <p>
11     This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14     <br>
15     <ul>
16     <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
17     <li><a name="TOC2" href="#SEC2">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a>
18     <li><a name="TOC3" href="#SEC3">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
19 nigel 77 <li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
20 nigel 75 </ul>
21     <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
22     <P>
23     In normal use of PCRE, if the subject string that is passed to
24 nigel 77 <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
25     too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
26     are circumstances where it might be helpful to distinguish this case from other
27     cases in which there is no match.
28 nigel 75 </P>
29     <P>
30     Consider, for example, an application where a human is required to type in data
31     for a field with specific formatting requirements. An example might be a date
32     in the form <i>ddmmmyy</i>, defined by this pattern:
33     <pre>
34     ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
35     </pre>
36     If the application sees the user's keystrokes one by one, and can check that
37     what has been typed so far is potentially valid, it is able to raise an error
38     as soon as a mistake is made, possibly beeping and not reflecting the
39     character that has been typed. This immediate feedback is likely to be a better
40     user interface than a check that is delayed until the entire string has been
41     entered.
42     </P>
43     <P>
44     PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
45 nigel 77 option, which can be set when calling <b>pcre_exec()</b> or
46     <b>pcre_dfa_exec()</b>. When this flag is set for <b>pcre_exec()</b>, the return
47     code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
48     during the matching process the last part of the subject string matched part of
49     the pattern. Unfortunately, for non-anchored matching, it is not possible to
50     obtain the position of the start of the partial match. No captured data is set
51     when PCRE_ERROR_PARTIAL is returned.
52 nigel 75 </P>
53     <P>
54 nigel 77 When PCRE_PARTIAL is set for <b>pcre_dfa_exec()</b>, the return code
55     PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
56     subject is reached, there have been no complete matches, but there is still at
57     least one matching possibility. The portion of the string that provided the
58     partial match is set as the first matching string.
59     </P>
60     <P>
61 nigel 75 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
62     last literal byte in a pattern, and abandons matching immediately if such a
63     byte is not present in the subject string. This optimization cannot be used
64     for a subject string that might match only partially.
65     </P>
66     <br><a name="SEC2" href="#TOC1">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a><br>
67     <P>
68 nigel 77 Because of the way certain internal optimizations are implemented in the
69     <b>pcre_exec()</b> function, the PCRE_PARTIAL option cannot be used with all
70     patterns. These restrictions do not apply when <b>pcre_dfa_exec()</b> is used.
71     For <b>pcre_exec()</b>, repeated single characters such as
72 nigel 75 <pre>
73     a{2,4}
74     </pre>
75     and repeated single metasequences such as
76     <pre>
77     \d+
78     </pre>
79     are not permitted if the maximum number of occurrences is greater than one.
80     Optional items such as \d? (where the maximum is one) are permitted.
81     Quantifiers with any values are permitted after parentheses, so the invalid
82     examples above can be coded thus:
83     <pre>
84     (a){2,4}
85     (\d)+
86     </pre>
87     These constructions run more slowly, but for the kinds of application that are
88     envisaged for this facility, this is not felt to be a major restriction.
89     </P>
90     <P>
91     If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
92     <b>pcre_exec()</b> returns the error code PCRE_ERROR_BADPARTIAL (-13).
93     </P>
94     <br><a name="SEC3" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
95     <P>
96     If the escape sequence \P is present in a <b>pcretest</b> data line, the
97     PCRE_PARTIAL flag is used for the match. Here is a run of <b>pcretest</b> that
98     uses the date example quoted above:
99     <pre>
100     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
101     data&#62; 25jun04\P
102     0: 25jun04
103     1: jun
104     data&#62; 25dec3\P
105     Partial match
106     data&#62; 3ju\P
107     Partial match
108     data&#62; 3juj\P
109     No match
110     data&#62; j\P
111     No match
112     </pre>
113     The first data string is matched completely, so <b>pcretest</b> shows the
114     matched substrings. The remaining four strings do not match the complete
115 nigel 93 pattern, but the first two are partial matches. The same test, using
116     <b>pcre_dfa_exec()</b> matching (by means of the \D escape sequence), produces
117     the following output:
118 nigel 77 <pre>
119     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
120     data&#62; 25jun04\P\D
121     0: 25jun04
122     data&#62; 23dec3\P\D
123     Partial match: 23dec3
124     data&#62; 3ju\P\D
125     Partial match: 3ju
126     data&#62; 3juj\P\D
127     No match
128     data&#62; j\P\D
129     No match
130     </pre>
131     Notice that in this case the portion of the string that was matched is made
132     available.
133 nigel 75 </P>
134 nigel 77 <br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
135 nigel 75 <P>
136 nigel 77 When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
137     to continue the match by providing additional subject data and calling
138 nigel 93 <b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
139     time setting the PCRE_DFA_RESTART option. You must also pass the same working
140     space as before, because this is where details of the previous partial match
141     are stored. Here is an example using <b>pcretest</b>, using the \R escape
142     sequence to set the PCRE_DFA_RESTART option (\P and \D are as above):
143 nigel 77 <pre>
144     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
145     data&#62; 23ja\P\D
146     Partial match: 23ja
147     data&#62; n05\R\D
148     0: n05
149     </pre>
150     The first call has "23ja" as the subject, and requests partial matching; the
151     second call has "n05" as the subject for the continued (restarted) match.
152     Notice that when the match is complete, only the last part is shown; PCRE does
153     not retain the previously partially-matched string. It is up to the calling
154     program to do that if it needs to.
155     </P>
156     <P>
157 nigel 93 You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
158     over multiple segments. This facility can be used to pass very long subject
159     strings to <b>pcre_dfa_exec()</b>. However, some care is needed for certain
160     types of pattern.
161 nigel 77 </P>
162     <P>
163     1. If the pattern contains tests for the beginning or end of a line, you need
164     to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
165     subject string for any call does not contain the beginning or end of a line.
166     </P>
167     <P>
168     2. If the pattern contains backward assertions (including \b or \B), you need
169     to arrange for some overlap in the subject strings to allow for this. For
170 nigel 93 example, you could pass the subject in chunks that are 500 bytes long, but in
171 nigel 77 a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
172     bytes at the start of the buffer.
173     </P>
174     <P>
175     3. Matching a subject string that is split into multiple segments does not
176     always produce exactly the same result as matching over one single long string.
177     The difference arises when there are multiple matching possibilities, because a
178     partial match result is given only when there are no completed matches in a
179     call to fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
180     been found, continuation to a new subject segment is no longer possible.
181     Consider this <b>pcretest</b> example:
182     <pre>
183     re&#62; /dog(sbody)?/
184     data&#62; do\P\D
185     Partial match: do
186     data&#62; gsb\R\P\D
187     0: g
188     data&#62; dogsbody\D
189     0: dogsbody
190     1: dog
191     </pre>
192     The pattern matches the words "dog" or "dogsbody". When the subject is
193     presented in several parts ("do" and "gsb" being the first two) the match stops
194     when "dog" has been found, and it is not possible to continue. On the other
195     hand, if "dogsbody" is presented as a single string, both matches are found.
196     </P>
197     <P>
198     Because of this phenomenon, it does not usually make sense to end a pattern
199     that is going to be matched in this way with a variable repeat.
200     </P>
201     <P>
202 nigel 87 4. Patterns that contain alternatives at the top level which do not all
203     start with the same pattern item may not work as expected. For example,
204     consider this pattern:
205     <pre>
206     1234|3789
207     </pre>
208     If the first part of the subject is "ABC123", a partial match of the first
209     alternative is found at offset 3. There is no partial match for the second
210     alternative, because such a match does not start at the same point in the
211     subject string. Attempting to continue with the string "789" does not yield a
212     match because only those alternatives that match at one point in the subject
213     are remembered. The problem arises because the start of the second alternative
214     matches within the first alternative. There is no problem with anchored
215     patterns or patterns such as:
216     <pre>
217     1234|ABCD
218     </pre>
219     where no string can be a partial match for both alternatives.
220     </P>
221     <P>
222 nigel 93 Last updated: 30 November 2006
223 nigel 75 <br>
224 nigel 87 Copyright &copy; 1997-2006 University of Cambridge.
225 nigel 75 <p>
226     Return to the <a href="index.html">PCRE index page</a>.
227     </p>

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12