/[pcre]/code/trunk/doc/pcrecpp.3
ViewVC logotype

Contents of /code/trunk/doc/pcrecpp.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 392 - (hide annotations) (download)
Tue Mar 17 21:30:30 2009 UTC (5 years, 5 months ago) by ph10
File size: 12552 byte(s)
Update after detrailing for a test release.

1 nigel 79 .TH PCRECPP 3
2 nigel 77 .SH NAME
3     PCRE - Perl-compatible regular expressions.
4     .SH "SYNOPSIS OF C++ WRAPPER"
5     .rs
6     .sp
7     .B #include <pcrecpp.h>
8 ph10 99 .
9 nigel 77 .SH DESCRIPTION
10     .rs
11     .sp
12 nigel 81 The C++ wrapper for PCRE was provided by Google Inc. Some additional
13     functionality was added by Giuseppe Maxia. This brief man page was constructed
14     from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
15     further details.
16 nigel 77 .
17     .
18     .SH "MATCHING INTERFACE"
19     .rs
20     .sp
21     The "FullMatch" operation checks that supplied text matches a supplied pattern
22     exactly. If pointer arguments are supplied, it copies matched sub-strings that
23     match sub-patterns into them.
24     .sp
25     Example: successful match
26     pcrecpp::RE re("h.*o");
27     re.FullMatch("hello");
28     .sp
29     Example: unsuccessful match (requires full match):
30     pcrecpp::RE re("e");
31     !re.FullMatch("hello");
32     .sp
33     Example: creating a temporary RE object:
34     pcrecpp::RE("h.*o").FullMatch("hello");
35     .sp
36     You can pass in a "const char*" or a "string" for "text". The examples below
37     tend to use a const char*. You can, as in the different examples above, store
38     the RE object explicitly in a variable or use a temporary RE object. The
39     examples below use one mode or the other arbitrarily. Either could correctly be
40     used for any of these examples.
41     .P
42     You must supply extra pointer arguments to extract matched subpieces.
43     .sp
44     Example: extracts "ruby" into "s" and 1234 into "i"
45     int i;
46     string s;
47     pcrecpp::RE re("(\e\ew+):(\e\ed+)");
48     re.FullMatch("ruby:1234", &s, &i);
49     .sp
50     Example: does not try to extract any extra sub-patterns
51     re.FullMatch("ruby:1234", &s);
52     .sp
53     Example: does not try to extract into NULL
54     re.FullMatch("ruby:1234", NULL, &i);
55     .sp
56     Example: integer overflow causes failure
57     !re.FullMatch("ruby:1234567891234", NULL, &i);
58     .sp
59     Example: fails because there aren't enough sub-patterns:
60     !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
61     .sp
62     Example: fails because string cannot be stored in integer
63     !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
64     .sp
65     The provided pointer arguments can be pointers to any scalar numeric
66     type, or one of:
67     .sp
68     string (matched piece is copied to string)
69     StringPiece (StringPiece is mutated to point to matched piece)
70     T (where "bool T::ParseFrom(const char*, int)" exists)
71     NULL (the corresponding matched sub-pattern is not copied)
72     .sp
73     The function returns true iff all of the following conditions are satisfied:
74     .sp
75     a. "text" matches "pattern" exactly;
76     .sp
77     b. The number of matched sub-patterns is >= number of supplied
78     pointers;
79     .sp
80     c. The "i"th argument has a suitable type for holding the
81     string captured as the "i"th sub-pattern. If you pass in
82 ph10 263 void * NULL for the "i"th argument, or a non-void * NULL
83 ph10 286 of the correct type, or pass fewer arguments than the
84 nigel 77 number of sub-patterns, "i"th captured sub-pattern is
85     ignored.
86     .sp
87 nigel 93 CAVEAT: An optional sub-pattern that does not exist in the matched
88     string is assigned the empty string. Therefore, the following will
89     return false (because the empty string is not a valid number):
90     .sp
91     int number;
92 ph10 155 pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number);
93 nigel 93 .sp
94 nigel 77 The matching interface supports at most 16 arguments per call.
95     If you need more, consider using the more general interface
96     \fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
97     \fBDoMatch\fP.
98 ph10 390 .P
99 ph10 392 NOTE: Do not use \fBno_arg\fP, which is used internally to mark the end of a
100     list of optional arguments, as a placeholder for missing arguments, as this can
101 ph10 390 lead to segfaults.
102 nigel 77 .
103 ph10 390 .
104 nigel 93 .SH "QUOTING METACHARACTERS"
105     .rs
106     .sp
107     You can use the "QuoteMeta" operation to insert backslashes before all
108     potentially meaningful characters in a string. The returned string, used as a
109     regular expression, will exactly match the original string.
110     .sp
111     Example:
112     string quoted = RE::QuoteMeta(unquoted);
113     .sp
114     Note that it's legal to escape a character even if it has no special meaning in
115     a regular expression -- so this function does that. (This also makes it
116     identical to the perl function of the same name; see "perldoc -f quotemeta".)
117     For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
118     .
119 nigel 77 .SH "PARTIAL MATCHES"
120     .rs
121     .sp
122     You can use the "PartialMatch" operation when you want the pattern
123     to match any substring of the text.
124     .sp
125     Example: simple search for a string:
126     pcrecpp::RE("ell").PartialMatch("hello");
127     .sp
128     Example: find first number in a string:
129     int number;
130     pcrecpp::RE re("(\e\ed+)");
131     re.PartialMatch("x*100 + 20", &number);
132     assert(number == 100);
133     .
134     .
135     .SH "UTF-8 AND THE MATCHING INTERFACE"
136     .rs
137     .sp
138     By default, pattern and text are plain text, one byte per character. The UTF8
139     flag, passed to the constructor, causes both pattern and string to be treated
140     as UTF-8 text, still a byte stream but potentially multiple bytes per
141     character. In practice, the text is likelier to be UTF-8 than the pattern, but
142     the match returned may depend on the UTF8 flag, so always use it when matching
143     UTF8 text. For example, "." will match one byte normally but with UTF8 set may
144     match up to three bytes of a multi-byte character.
145     .sp
146     Example:
147     pcrecpp::RE_Options options;
148     options.set_utf8();
149     pcrecpp::RE re(utf8_pattern, options);
150     re.FullMatch(utf8_string);
151     .sp
152     Example: using the convenience function UTF8():
153     pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
154     re.FullMatch(utf8_string);
155     .sp
156     NOTE: The UTF8 flag is ignored if pcre was not configured with the
157     --enable-utf8 flag.
158     .
159     .
160 nigel 81 .SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
161     .rs
162     .sp
163     PCRE defines some modifiers to change the behavior of the regular expression
164     engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
165     pass such modifiers to a RE class. Currently, the following modifiers are
166     supported:
167     .sp
168     modifier description Perl corresponding
169     .sp
170     PCRE_CASELESS case insensitive match /i
171     PCRE_MULTILINE multiple lines match /m
172     PCRE_DOTALL dot matches newlines /s
173     PCRE_DOLLAR_ENDONLY $ matches only at end N/A
174     PCRE_EXTRA strict escape parsing N/A
175     PCRE_EXTENDED ignore whitespaces /x
176     PCRE_UTF8 handles UTF8 chars built-in
177     PCRE_UNGREEDY reverses * and *? N/A
178     PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
179     .sp
180     (*) Both Perl and PCRE allow non capturing parentheses by means of the
181     "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
182     capture, while (ab|cd) does.
183     .P
184     For a full account on how each modifier works, please check the
185     PCRE API reference page.
186     .P
187     For each modifier, there are two member functions whose name is made
188     out of the modifier in lowercase, without the "PCRE_" prefix. For
189     instance, PCRE_CASELESS is handled by
190     .sp
191     bool caseless()
192     .sp
193     which returns true if the modifier is set, and
194     .sp
195     RE_Options & set_caseless(bool)
196     .sp
197 nigel 87 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
198 nigel 81 accessed through the \fBset_match_limit()\fR and \fBmatch_limit()\fR member
199     functions. Setting \fImatch_limit\fR to a non-zero value will limit the
200     execution of pcre to keep it from doing bad things like blowing the stack or
201     taking an eternity to return a result. A value of 5000 is good enough to stop
202     stack blowup in a 2MB thread stack. Setting \fImatch_limit\fR to zero disables
203 nigel 87 match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
204     which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
205     recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
206     \fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
207     therefore the amount of stack that is used.
208 nigel 81 .P
209     Normally, to pass one or more modifiers to a RE class, you declare
210     a \fIRE_Options\fR object, set the appropriate options, and pass this
211     object to a RE constructor. Example:
212     .sp
213     RE_options opt;
214     opt.set_caseless(true);
215     if (RE("HELLO", opt).PartialMatch("hello world")) ...
216     .sp
217     RE_options has two constructors. The default constructor takes no arguments and
218     creates a set of flags that are off by default. The optional parameter
219     \fIoption_flags\fR is to facilitate transfer of legacy code from C programs.
220     This lets you do
221     .sp
222     RE(pattern,
223     RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
224     .sp
225     However, new code is better off doing
226     .sp
227     RE(pattern,
228     RE_Options().set_caseless(true).set_multiline(true))
229     .PartialMatch(str);
230     .sp
231     If you are going to pass one of the most used modifiers, there are some
232     convenience functions that return a RE_Options class with the
233     appropriate modifier already set: \fBCASELESS()\fR, \fBUTF8()\fR,
234     \fBMULTILINE()\fR, \fBDOTALL\fR(), and \fBEXTENDED()\fR.
235     .P
236     If you need to set several options at once, and you don't want to go through
237     the pains of declaring a RE_Options object and setting several options, there
238     is a parallel method that give you such ability on the fly. You can concatenate
239     several \fBset_xxxxx()\fR member functions, since each of them returns a
240     reference to its class object. For example, to pass PCRE_CASELESS,
241     PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
242     .sp
243     RE(" ^ xyz \e\es+ .* blah$",
244     RE_Options()
245     .set_caseless(true)
246     .set_extended(true)
247     .set_multiline(true)).PartialMatch(sometext);
248     .sp
249     .
250     .
251 nigel 77 .SH "SCANNING TEXT INCREMENTALLY"
252     .rs
253     .sp
254     The "Consume" operation may be useful if you want to repeatedly
255     match regular expressions at the front of a string and skip over
256     them as they match. This requires use of the "StringPiece" type,
257     which represents a sub-range of a real string. Like RE, StringPiece
258     is defined in the pcrecpp namespace.
259     .sp
260     Example: read lines of the form "var = value" from a string.
261     string contents = ...; // Fill string somehow
262     pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
263    
264     string var;
265     int value;
266     pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
267     while (re.Consume(&input, &var, &value)) {
268     ...;
269     }
270     .sp
271     Each successful call to "Consume" will set "var/value", and also
272     advance "input" so it points past the matched text.
273     .P
274     The "FindAndConsume" operation is similar to "Consume" but does not
275     anchor your match at the beginning of the string. For example, you
276     could extract all words from a string by repeatedly calling
277     .sp
278     pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
279     .
280     .
281     .SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
282     .rs
283     .sp
284     By default, if you pass a pointer to a numeric value, the
285     corresponding text is interpreted as a base-10 number. You can
286     instead wrap the pointer with a call to one of the operators Hex(),
287     Octal(), or CRadix() to interpret the text in another base. The
288     CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
289     prefixes, but defaults to base-10.
290     .sp
291     Example:
292     int a, b, c, d;
293     pcrecpp::RE re("(.*) (.*) (.*) (.*)");
294     re.FullMatch("100 40 0100 0x40",
295     pcrecpp::Octal(&a), pcrecpp::Hex(&b),
296     pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
297     .sp
298     will leave 64 in a, b, c, and d.
299     .
300     .
301     .SH "REPLACING PARTS OF STRINGS"
302     .rs
303     .sp
304     You can replace the first match of "pattern" in "str" with "rewrite".
305     Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
306     used to insert text matching corresponding parenthesized group
307     from the pattern. \e0 in "rewrite" refers to the entire matching
308     text. For example:
309     .sp
310     string s = "yabba dabba doo";
311     pcrecpp::RE("b+").Replace("d", &s);
312     .sp
313     will leave "s" containing "yada dabba doo". The result is true if the pattern
314     matches and a replacement occurs, false otherwise.
315     .P
316     \fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
317     occurrences of the pattern in the string with the rewrite. Replacements are
318     not subject to re-matching. For example:
319     .sp
320     string s = "yabba dabba doo";
321     pcrecpp::RE("b+").GlobalReplace("d", &s);
322     .sp
323     will leave "s" containing "yada dada doo". It returns the number of
324     replacements made.
325     .P
326     \fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
327     "rewrite" is copied into "out" (an additional argument) with substitutions.
328     The non-matching portions of "text" are ignored. Returns true iff a match
329     occurred and the extraction happened successfully; if no match occurs, the
330     string is left unaffected.
331     .
332     .
333     .SH AUTHOR
334     .rs
335     .sp
336 ph10 99 .nf
337 nigel 77 The C++ wrapper was contributed by Google Inc.
338 ph10 117 Copyright (c) 2007 Google Inc.
339 ph10 99 .fi
340     .
341     .
342     .SH REVISION
343     .rs
344     .sp
345     .nf
346 ph10 390 Last updated: 17 March 2009
347 ph10 99 .fi

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12