ViewVC logotype

Contents of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log

Revision 77 - (hide annotations) (download)
Sat Feb 24 21:40:45 2007 UTC (8 years, 2 months ago) by nigel
File size: 7384 byte(s)
Load pcre-6.0 into code/trunk.

1 nigel 75 .TH PCRE 3
2     .SH NAME
3     PCRE - Perl-compatible regular expressions
5     .rs
6     .sp
7     In normal use of PCRE, if the subject string that is passed to
8 nigel 77 \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is
9     too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
10     are circumstances where it might be helpful to distinguish this case from other
11     cases in which there is no match.
12 nigel 75 .P
13     Consider, for example, an application where a human is required to type in data
14     for a field with specific formatting requirements. An example might be a date
15     in the form \fIddmmmyy\fP, defined by this pattern:
16     .sp
17     ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
18     .sp
19     If the application sees the user's keystrokes one by one, and can check that
20     what has been typed so far is potentially valid, it is able to raise an error
21     as soon as a mistake is made, possibly beeping and not reflecting the
22     character that has been typed. This immediate feedback is likely to be a better
23     user interface than a check that is delayed until the entire string has been
24     entered.
25     .P
26     PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
27 nigel 77 option, which can be set when calling \fBpcre_exec()\fP or
28     \fBpcre_dfa_exec()\fP. When this flag is set for \fBpcre_exec()\fP, the return
29     code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
30     during the matching process the last part of the subject string matched part of
31     the pattern. Unfortunately, for non-anchored matching, it is not possible to
32     obtain the position of the start of the partial match. No captured data is set
33     when PCRE_ERROR_PARTIAL is returned.
34 nigel 75 .P
35 nigel 77 When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code
36     PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
37     subject is reached, there have been no complete matches, but there is still at
38     least one matching possibility. The portion of the string that provided the
39     partial match is set as the first matching string.
40     .P
41 nigel 75 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
42     last literal byte in a pattern, and abandons matching immediately if such a
43     byte is not present in the subject string. This optimization cannot be used
44     for a subject string that might match only partially.
45     .
46     .
48     .rs
49     .sp
50 nigel 77 Because of the way certain internal optimizations are implemented in the
51     \fBpcre_exec()\fP function, the PCRE_PARTIAL option cannot be used with all
52     patterns. These restrictions do not apply when \fBpcre_dfa_exec()\fP is used.
53     For \fBpcre_exec()\fP, repeated single characters such as
54 nigel 75 .sp
55     a{2,4}
56     .sp
57     and repeated single metasequences such as
58     .sp
59     \ed+
60     .sp
61     are not permitted if the maximum number of occurrences is greater than one.
62     Optional items such as \ed? (where the maximum is one) are permitted.
63     Quantifiers with any values are permitted after parentheses, so the invalid
64     examples above can be coded thus:
65     .sp
66     (a){2,4}
67     (\ed)+
68     .sp
69     These constructions run more slowly, but for the kinds of application that are
70     envisaged for this facility, this is not felt to be a major restriction.
71     .P
72     If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
73     \fBpcre_exec()\fP returns the error code PCRE_ERROR_BADPARTIAL (-13).
74     .
75     .
77     .rs
78     .sp
79     If the escape sequence \eP is present in a \fBpcretest\fP data line, the
80     PCRE_PARTIAL flag is used for the match. Here is a run of \fBpcretest\fP that
81     uses the date example quoted above:
82     .sp
83     re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
84     data> 25jun04\P
85     0: 25jun04
86     1: jun
87     data> 25dec3\P
88     Partial match
89     data> 3ju\P
90     Partial match
91     data> 3juj\P
92     No match
93     data> j\P
94     No match
95     .sp
96     The first data string is matched completely, so \fBpcretest\fP shows the
97     matched substrings. The remaining four strings do not match the complete
98 nigel 77 pattern, but the first two are partial matches. The same test, using DFA
99     matching (by means of the \eD escape sequence), produces the following output:
100     .sp
101     re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
102     data> 25jun04\eP\eD
103     0: 25jun04
104     data> 23dec3\eP\eD
105     Partial match: 23dec3
106     data> 3ju\eP\eD
107     Partial match: 3ju
108     data> 3juj\eP\eD
109     No match
110     data> j\eP\eD
111     No match
112     .sp
113     Notice that in this case the portion of the string that was matched is made
114     available.
115 nigel 75 .
116     .
117 nigel 77 .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
118     .rs
119     .sp
120     When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible
121     to continue the match by providing additional subject data and calling
122     \fBpcre_dfa_exec()\fP again with the PCRE_DFA_RESTART option and the same
123     working space (where details of the previous partial match are stored). Here is
124     an example using \fBpcretest\fP, where the \eR escape sequence sets the
125     PCRE_DFA_RESTART option and the \eD escape sequence requests the use of
126     \fBpcre_dfa_exec()\fP:
127     .sp
128     re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
129     data> 23ja\eP\eD
130     Partial match: 23ja
131     data> n05\eR\eD
132     0: n05
133     .sp
134     The first call has "23ja" as the subject, and requests partial matching; the
135     second call has "n05" as the subject for the continued (restarted) match.
136     Notice that when the match is complete, only the last part is shown; PCRE does
137     not retain the previously partially-matched string. It is up to the calling
138     program to do that if it needs to.
139 nigel 75 .P
140 nigel 77 This facility can be used to pass very long subject strings to
141     \fBpcre_dfa_exec()\fP. However, some care is needed for certain types of
142     pattern.
143     .P
144     1. If the pattern contains tests for the beginning or end of a line, you need
145     to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
146     subject string for any call does not contain the beginning or end of a line.
147     .P
148     2. If the pattern contains backward assertions (including \eb or \eB), you need
149     to arrange for some overlap in the subject strings to allow for this. For
150     example, you could pass the subject in chunks that were 500 bytes long, but in
151     a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
152     bytes at the start of the buffer.
153     .P
154     3. Matching a subject string that is split into multiple segments does not
155     always produce exactly the same result as matching over one single long string.
156     The difference arises when there are multiple matching possibilities, because a
157     partial match result is given only when there are no completed matches in a
158     call to fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
159     been found, continuation to a new subject segment is no longer possible.
160     Consider this \fBpcretest\fP example:
161     .sp
162     re> /dog(sbody)?/
163     data> do\eP\eD
164     Partial match: do
165     data> gsb\eR\eP\eD
166     0: g
167     data> dogsbody\eD
168     0: dogsbody
169     1: dog
170     .sp
171     The pattern matches the words "dog" or "dogsbody". When the subject is
172     presented in several parts ("do" and "gsb" being the first two) the match stops
173     when "dog" has been found, and it is not possible to continue. On the other
174     hand, if "dogsbody" is presented as a single string, both matches are found.
175     .P
176     Because of this phenomenon, it does not usually make sense to end a pattern
177     that is going to be matched in this way with a variable repeat.
178     .
179     .
180     .P
181 nigel 75 .in 0
182 nigel 77 Last updated: 28 February 2005
183 nigel 75 .br
184 nigel 77 Copyright (c) 1997-2005 University of Cambridge.

ViewVC Help
Powered by ViewVC 1.1.12