ViewVC logotype

Contents of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log

Revision 172 - (hide annotations) (download)
Tue Jun 5 10:40:13 2007 UTC (7 years, 10 months ago) by ph10
File size: 8657 byte(s)
Drastically reduce workspace used for alternatives in groups; also some 
trailing space removals for a test release.

1 nigel 79 .TH PCREPARTIAL 3
2 nigel 75 .SH NAME
3     PCRE - Perl-compatible regular expressions
5     .rs
6     .sp
7     In normal use of PCRE, if the subject string that is passed to
8 nigel 77 \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is
9     too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
10     are circumstances where it might be helpful to distinguish this case from other
11     cases in which there is no match.
12 nigel 75 .P
13     Consider, for example, an application where a human is required to type in data
14     for a field with specific formatting requirements. An example might be a date
15     in the form \fIddmmmyy\fP, defined by this pattern:
16     .sp
17     ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
18     .sp
19     If the application sees the user's keystrokes one by one, and can check that
20     what has been typed so far is potentially valid, it is able to raise an error
21     as soon as a mistake is made, possibly beeping and not reflecting the
22     character that has been typed. This immediate feedback is likely to be a better
23     user interface than a check that is delayed until the entire string has been
24     entered.
25     .P
26     PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
27 nigel 77 option, which can be set when calling \fBpcre_exec()\fP or
28     \fBpcre_dfa_exec()\fP. When this flag is set for \fBpcre_exec()\fP, the return
29     code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
30     during the matching process the last part of the subject string matched part of
31     the pattern. Unfortunately, for non-anchored matching, it is not possible to
32     obtain the position of the start of the partial match. No captured data is set
33     when PCRE_ERROR_PARTIAL is returned.
34 nigel 75 .P
35 nigel 77 When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code
36     PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
37     subject is reached, there have been no complete matches, but there is still at
38     least one matching possibility. The portion of the string that provided the
39     partial match is set as the first matching string.
40     .P
41 nigel 75 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
42     last literal byte in a pattern, and abandons matching immediately if such a
43     byte is not present in the subject string. This optimization cannot be used
44     for a subject string that might match only partially.
45     .
46     .
48     .rs
49     .sp
50 nigel 77 Because of the way certain internal optimizations are implemented in the
51     \fBpcre_exec()\fP function, the PCRE_PARTIAL option cannot be used with all
52     patterns. These restrictions do not apply when \fBpcre_dfa_exec()\fP is used.
53     For \fBpcre_exec()\fP, repeated single characters such as
54 nigel 75 .sp
55     a{2,4}
56     .sp
57     and repeated single metasequences such as
58     .sp
59     \ed+
60     .sp
61     are not permitted if the maximum number of occurrences is greater than one.
62     Optional items such as \ed? (where the maximum is one) are permitted.
63     Quantifiers with any values are permitted after parentheses, so the invalid
64     examples above can be coded thus:
65     .sp
66     (a){2,4}
67     (\ed)+
68     .sp
69     These constructions run more slowly, but for the kinds of application that are
70     envisaged for this facility, this is not felt to be a major restriction.
71     .P
72     If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
73     \fBpcre_exec()\fP returns the error code PCRE_ERROR_BADPARTIAL (-13).
74 ph10 172 You can use the PCRE_INFO_OKPARTIAL call to \fBpcre_fullinfo()\fP to find out
75 ph10 169 if a compiled pattern can be used for partial matching.
76 nigel 75 .
77     .
79     .rs
80     .sp
81     If the escape sequence \eP is present in a \fBpcretest\fP data line, the
82     PCRE_PARTIAL flag is used for the match. Here is a run of \fBpcretest\fP that
83     uses the date example quoted above:
84     .sp
85     re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
86 nigel 87 data> 25jun04\eP
87 nigel 75 0: 25jun04
88     1: jun
89 nigel 87 data> 25dec3\eP
90 nigel 75 Partial match
91 nigel 87 data> 3ju\eP
92 nigel 75 Partial match
93 nigel 87 data> 3juj\eP
94 nigel 75 No match
95 nigel 87 data> j\eP
96 nigel 75 No match
97     .sp
98     The first data string is matched completely, so \fBpcretest\fP shows the
99     matched substrings. The remaining four strings do not match the complete
100 nigel 93 pattern, but the first two are partial matches. The same test, using
101     \fBpcre_dfa_exec()\fP matching (by means of the \eD escape sequence), produces
102     the following output:
103 nigel 77 .sp
104 ph10 155 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
105 nigel 77 data> 25jun04\eP\eD
106     0: 25jun04
107     data> 23dec3\eP\eD
108     Partial match: 23dec3
109     data> 3ju\eP\eD
110     Partial match: 3ju
111     data> 3juj\eP\eD
112     No match
113     data> j\eP\eD
114     No match
115     .sp
116     Notice that in this case the portion of the string that was matched is made
117     available.
118 nigel 75 .
119     .
120 nigel 77 .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
121     .rs
122     .sp
123     When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible
124     to continue the match by providing additional subject data and calling
125 nigel 93 \fBpcre_dfa_exec()\fP again with the same compiled regular expression, this
126     time setting the PCRE_DFA_RESTART option. You must also pass the same working
127     space as before, because this is where details of the previous partial match
128     are stored. Here is an example using \fBpcretest\fP, using the \eR escape
129     sequence to set the PCRE_DFA_RESTART option (\eP and \eD are as above):
130 nigel 77 .sp
131 ph10 155 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
132 nigel 77 data> 23ja\eP\eD
133     Partial match: 23ja
134     data> n05\eR\eD
135     0: n05
136     .sp
137     The first call has "23ja" as the subject, and requests partial matching; the
138     second call has "n05" as the subject for the continued (restarted) match.
139     Notice that when the match is complete, only the last part is shown; PCRE does
140     not retain the previously partially-matched string. It is up to the calling
141     program to do that if it needs to.
142 nigel 75 .P
143 nigel 93 You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
144     over multiple segments. This facility can be used to pass very long subject
145     strings to \fBpcre_dfa_exec()\fP. However, some care is needed for certain
146     types of pattern.
147 nigel 77 .P
148     1. If the pattern contains tests for the beginning or end of a line, you need
149     to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
150     subject string for any call does not contain the beginning or end of a line.
151     .P
152     2. If the pattern contains backward assertions (including \eb or \eB), you need
153     to arrange for some overlap in the subject strings to allow for this. For
154 nigel 93 example, you could pass the subject in chunks that are 500 bytes long, but in
155 nigel 77 a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
156     bytes at the start of the buffer.
157     .P
158     3. Matching a subject string that is split into multiple segments does not
159     always produce exactly the same result as matching over one single long string.
160     The difference arises when there are multiple matching possibilities, because a
161     partial match result is given only when there are no completed matches in a
162 ph10 148 call to \fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
163 nigel 77 been found, continuation to a new subject segment is no longer possible.
164     Consider this \fBpcretest\fP example:
165     .sp
166     re> /dog(sbody)?/
167     data> do\eP\eD
168     Partial match: do
169     data> gsb\eR\eP\eD
170     0: g
171     data> dogsbody\eD
172     0: dogsbody
173     1: dog
174     .sp
175     The pattern matches the words "dog" or "dogsbody". When the subject is
176     presented in several parts ("do" and "gsb" being the first two) the match stops
177     when "dog" has been found, and it is not possible to continue. On the other
178     hand, if "dogsbody" is presented as a single string, both matches are found.
179     .P
180     Because of this phenomenon, it does not usually make sense to end a pattern
181     that is going to be matched in this way with a variable repeat.
182 nigel 87 .P
183     4. Patterns that contain alternatives at the top level which do not all
184     start with the same pattern item may not work as expected. For example,
185     consider this pattern:
186     .sp
187     1234|3789
188     .sp
189     If the first part of the subject is "ABC123", a partial match of the first
190     alternative is found at offset 3. There is no partial match for the second
191     alternative, because such a match does not start at the same point in the
192     subject string. Attempting to continue with the string "789" does not yield a
193     match because only those alternatives that match at one point in the subject
194     are remembered. The problem arises because the start of the second alternative
195     matches within the first alternative. There is no problem with anchored
196     patterns or patterns such as:
197     .sp
198     1234|ABCD
199     .sp
200     where no string can be a partial match for both alternatives.
201 nigel 77 .
202     .
203 ph10 99 .SH AUTHOR
204     .rs
205     .sp
206     .nf
207     Philip Hazel
208     University Computing Service
209     Cambridge CB2 3QH, England.
210     .fi
211     .
212     .
213     .SH REVISION
214     .rs
215     .sp
216     .nf
217 ph10 169 Last updated: 04 June 2007
218 ph10 99 Copyright (c) 1997-2007 University of Cambridge.
219     .fi


Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

ViewVC Help
Powered by ViewVC 1.1.12