/[pcre]/code/trunk/doc/html/pcrematching.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcrematching.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 93 - (hide annotations) (download) (as text)
Sat Feb 24 21:41:42 2007 UTC (6 years, 2 months ago) by nigel
File MIME type: text/html
File size: 9007 byte(s)
Load pcre-7.0 into code/trunk.

1 nigel 77 <html>
2     <head>
3     <title>pcrematching specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6     <h1>pcrematching man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10     <p>
11     This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14     <br>
15     <ul>
16     <li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a>
17     <li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
18     <li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
19 nigel 93 <li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
20     <li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
21     <li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
22 nigel 77 </ul>
23     <br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br>
24     <P>
25     This document describes the two different algorithms that are available in PCRE
26     for matching a compiled regular expression against a given subject string. The
27     "standard" algorithm is the one provided by the <b>pcre_exec()</b> function.
28     This works in the same was as Perl's matching function, and provides a
29     Perl-compatible matching operation.
30     </P>
31     <P>
32     An alternative algorithm is provided by the <b>pcre_dfa_exec()</b> function;
33     this operates in a different way, and is not Perl-compatible. It has advantages
34     and disadvantages compared with the standard algorithm, and these are described
35     below.
36     </P>
37     <P>
38     When there is only one possible way in which a given subject string can match a
39     pattern, the two algorithms give the same answer. A difference arises, however,
40     when there are multiple possibilities. For example, if the pattern
41     <pre>
42     ^&#60;.*&#62;
43     </pre>
44     is matched against the string
45     <pre>
46     &#60;something&#62; &#60;something else&#62; &#60;something further&#62;
47     </pre>
48     there are three possible answers. The standard algorithm finds only one of
49 nigel 93 them, whereas the alternative algorithm finds all three.
50 nigel 77 </P>
51     <br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
52     <P>
53     The set of strings that are matched by a regular expression can be represented
54     as a tree structure. An unlimited repetition in the pattern makes the tree of
55     infinite size, but it is still a tree. Matching the pattern to a given subject
56     string (from a given starting point) can be thought of as a search of the tree.
57 nigel 91 There are two ways to search a tree: depth-first and breadth-first, and these
58     correspond to the two matching algorithms provided by PCRE.
59 nigel 77 </P>
60     <br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br>
61     <P>
62     In the terminology of Jeffrey Friedl's book \fIMastering Regular
63     Expressions\fP, the standard algorithm is an "NFA algorithm". It conducts a
64     depth-first search of the pattern tree. That is, it proceeds along a single
65     path through the tree, checking that the subject matches what is required. When
66     there is a mismatch, the algorithm tries any alternatives at the current point,
67     and if they all fail, it backs up to the previous branch point in the tree, and
68     tries the next alternative branch at that level. This often involves backing up
69     (moving to the left) in the subject string as well. The order in which
70     repetition branches are tried is controlled by the greedy or ungreedy nature of
71     the quantifier.
72     </P>
73     <P>
74     If a leaf node is reached, a matching string has been found, and at that point
75     the algorithm stops. Thus, if there is more than one possible match, this
76     algorithm returns the first one that it finds. Whether this is the shortest,
77     the longest, or some intermediate length depends on the way the greedy and
78     ungreedy repetition quantifiers are specified in the pattern.
79     </P>
80     <P>
81     Because it ends up with a single path through the tree, it is relatively
82     straightforward for this algorithm to keep track of the substrings that are
83     matched by portions of the pattern in parentheses. This provides support for
84     capturing parentheses and back references.
85     </P>
86 nigel 93 <br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
87 nigel 77 <P>
88 nigel 93 This algorithm conducts a breadth-first search of the tree. Starting from the
89     first matching point in the subject, it scans the subject string from left to
90     right, once, character by character, and as it does this, it remembers all the
91     paths through the tree that represent valid matches. In Friedl's terminology,
92     this is a kind of "DFA algorithm", though it is not implemented as a
93     traditional finite state machine (it keeps multiple states active
94     simultaneously).
95 nigel 77 </P>
96     <P>
97     The scan continues until either the end of the subject is reached, or there are
98     no more unterminated paths. At this point, terminated paths represent the
99     different matching possibilities (if there are none, the match has failed).
100     Thus, if there is more than one possible match, this algorithm finds all of
101     them, and in particular, it finds the longest. In PCRE, there is an option to
102     stop the algorithm after the first match (which is necessarily the shortest)
103     has been found.
104     </P>
105     <P>
106     Note that all the matches that are found start at the same point in the
107     subject. If the pattern
108     <pre>
109     cat(er(pillar)?)
110     </pre>
111     is matched against the string "the caterpillar catchment", the result will be
112     the three strings "cat", "cater", and "caterpillar" that start at the fourth
113     character of the subject. The algorithm does not automatically move on to find
114     matches that start at later positions.
115     </P>
116     <P>
117     There are a number of features of PCRE regular expressions that are not
118 nigel 93 supported by the alternative matching algorithm. They are as follows:
119 nigel 77 </P>
120     <P>
121     1. Because the algorithm finds all possible matches, the greedy or ungreedy
122     nature of repetition quantifiers is not relevant. Greedy and ungreedy
123 nigel 93 quantifiers are treated in exactly the same way. However, possessive
124     quantifiers can make a difference when what follows could also match what is
125     quantified, for example in a pattern like this:
126     <pre>
127     ^a++\w!
128     </pre>
129     This pattern matches "aaab!" but not "aaa!", which would be matched by a
130     non-possessive quantifier. Similarly, if an atomic group is present, it is
131     matched as if it were a standalone pattern at the current point, and the
132     longest match is then "locked in" for the rest of the overall pattern.
133 nigel 77 </P>
134     <P>
135     2. When dealing with multiple paths through the tree simultaneously, it is not
136     straightforward to keep track of captured substrings for the different matching
137     possibilities, and PCRE's implementation of this algorithm does not attempt to
138     do this. This means that no captured substrings are available.
139     </P>
140     <P>
141     3. Because no substrings are captured, back references within the pattern are
142     not supported, and cause errors if encountered.
143     </P>
144     <P>
145     4. For the same reason, conditional expressions that use a backreference as the
146 nigel 93 condition or test for a specific group recursion are not supported.
147 nigel 77 </P>
148     <P>
149     5. Callouts are supported, but the value of the <i>capture_top</i> field is
150     always 1, and the value of the <i>capture_last</i> field is always -1.
151     </P>
152     <P>
153     6.
154     The \C escape sequence, which (in the standard algorithm) matches a single
155 nigel 93 byte, even in UTF-8 mode, is not supported because the alternative algorithm
156     moves through the subject string one character at a time, for all active paths
157 nigel 77 through the tree.
158     </P>
159 nigel 93 <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
160 nigel 77 <P>
161 nigel 93 Using the alternative matching algorithm provides the following advantages:
162 nigel 77 </P>
163     <P>
164     1. All possible matches (at a single point in the subject) are automatically
165     found, and in particular, the longest match is found. To find more than one
166     match using the standard algorithm, you have to do kludgy things with
167     callouts.
168     </P>
169     <P>
170     2. There is much better support for partial matching. The restrictions on the
171     content of the pattern that apply when using the standard algorithm for partial
172 nigel 93 matching do not apply to the alternative algorithm. For non-anchored patterns,
173     the starting position of a partial match is available.
174 nigel 77 </P>
175     <P>
176 nigel 93 3. Because the alternative algorithm scans the subject string just once, and
177     never needs to backtrack, it is possible to pass very long subject strings to
178     the matching function in several pieces, checking for partial matching each
179     time.
180 nigel 77 </P>
181 nigel 93 <br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
182 nigel 77 <P>
183 nigel 93 The alternative algorithm suffers from a number of disadvantages:
184 nigel 77 </P>
185     <P>
186     1. It is substantially slower than the standard algorithm. This is partly
187     because it has to search for all possible matches, but is also because it is
188     less susceptible to optimization.
189     </P>
190     <P>
191     2. Capturing parentheses and back references are not supported.
192     </P>
193     <P>
194 nigel 93 3. Although atomic groups are supported, their use does not provide the
195     performance advantage that it does for the standard algorithm.
196 nigel 77 </P>
197     <P>
198 nigel 93 Last updated: 24 November 2006
199 nigel 77 <br>
200 nigel 91 Copyright &copy; 1997-2006 University of Cambridge.
201 nigel 77 <p>
202     Return to the <a href="index.html">PCRE index page</a>.
203     </p>

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12