/[pcre]/code/trunk/doc/pcrecpp.3
ViewVC logotype

Contents of /code/trunk/doc/pcrecpp.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 286 - (hide annotations) (download)
Mon Dec 17 14:46:11 2007 UTC (5 years, 5 months ago) by ph10
File size: 12374 byte(s)
Add .gz and .bz2 optional support to pcregrep.

1 nigel 79 .TH PCRECPP 3
2 nigel 77 .SH NAME
3     PCRE - Perl-compatible regular expressions.
4     .SH "SYNOPSIS OF C++ WRAPPER"
5     .rs
6     .sp
7     .B #include <pcrecpp.h>
8 ph10 99 .
9 nigel 77 .SH DESCRIPTION
10     .rs
11     .sp
12 nigel 81 The C++ wrapper for PCRE was provided by Google Inc. Some additional
13     functionality was added by Giuseppe Maxia. This brief man page was constructed
14     from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
15     further details.
16 nigel 77 .
17     .
18     .SH "MATCHING INTERFACE"
19     .rs
20     .sp
21     The "FullMatch" operation checks that supplied text matches a supplied pattern
22     exactly. If pointer arguments are supplied, it copies matched sub-strings that
23     match sub-patterns into them.
24     .sp
25     Example: successful match
26     pcrecpp::RE re("h.*o");
27     re.FullMatch("hello");
28     .sp
29     Example: unsuccessful match (requires full match):
30     pcrecpp::RE re("e");
31     !re.FullMatch("hello");
32     .sp
33     Example: creating a temporary RE object:
34     pcrecpp::RE("h.*o").FullMatch("hello");
35     .sp
36     You can pass in a "const char*" or a "string" for "text". The examples below
37     tend to use a const char*. You can, as in the different examples above, store
38     the RE object explicitly in a variable or use a temporary RE object. The
39     examples below use one mode or the other arbitrarily. Either could correctly be
40     used for any of these examples.
41     .P
42     You must supply extra pointer arguments to extract matched subpieces.
43     .sp
44     Example: extracts "ruby" into "s" and 1234 into "i"
45     int i;
46     string s;
47     pcrecpp::RE re("(\e\ew+):(\e\ed+)");
48     re.FullMatch("ruby:1234", &s, &i);
49     .sp
50     Example: does not try to extract any extra sub-patterns
51     re.FullMatch("ruby:1234", &s);
52     .sp
53     Example: does not try to extract into NULL
54     re.FullMatch("ruby:1234", NULL, &i);
55     .sp
56     Example: integer overflow causes failure
57     !re.FullMatch("ruby:1234567891234", NULL, &i);
58     .sp
59     Example: fails because there aren't enough sub-patterns:
60     !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
61     .sp
62     Example: fails because string cannot be stored in integer
63     !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
64     .sp
65     The provided pointer arguments can be pointers to any scalar numeric
66     type, or one of:
67     .sp
68     string (matched piece is copied to string)
69     StringPiece (StringPiece is mutated to point to matched piece)
70     T (where "bool T::ParseFrom(const char*, int)" exists)
71     NULL (the corresponding matched sub-pattern is not copied)
72     .sp
73     The function returns true iff all of the following conditions are satisfied:
74     .sp
75     a. "text" matches "pattern" exactly;
76     .sp
77     b. The number of matched sub-patterns is >= number of supplied
78     pointers;
79     .sp
80     c. The "i"th argument has a suitable type for holding the
81     string captured as the "i"th sub-pattern. If you pass in
82 ph10 263 void * NULL for the "i"th argument, or a non-void * NULL
83 ph10 286 of the correct type, or pass fewer arguments than the
84 nigel 77 number of sub-patterns, "i"th captured sub-pattern is
85     ignored.
86     .sp
87 nigel 93 CAVEAT: An optional sub-pattern that does not exist in the matched
88     string is assigned the empty string. Therefore, the following will
89     return false (because the empty string is not a valid number):
90     .sp
91     int number;
92 ph10 155 pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number);
93 nigel 93 .sp
94 nigel 77 The matching interface supports at most 16 arguments per call.
95     If you need more, consider using the more general interface
96     \fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
97     \fBDoMatch\fP.
98     .
99 nigel 93 .SH "QUOTING METACHARACTERS"
100     .rs
101     .sp
102     You can use the "QuoteMeta" operation to insert backslashes before all
103     potentially meaningful characters in a string. The returned string, used as a
104     regular expression, will exactly match the original string.
105     .sp
106     Example:
107     string quoted = RE::QuoteMeta(unquoted);
108     .sp
109     Note that it's legal to escape a character even if it has no special meaning in
110     a regular expression -- so this function does that. (This also makes it
111     identical to the perl function of the same name; see "perldoc -f quotemeta".)
112     For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
113     .
114 nigel 77 .SH "PARTIAL MATCHES"
115     .rs
116     .sp
117     You can use the "PartialMatch" operation when you want the pattern
118     to match any substring of the text.
119     .sp
120     Example: simple search for a string:
121     pcrecpp::RE("ell").PartialMatch("hello");
122     .sp
123     Example: find first number in a string:
124     int number;
125     pcrecpp::RE re("(\e\ed+)");
126     re.PartialMatch("x*100 + 20", &number);
127     assert(number == 100);
128     .
129     .
130     .SH "UTF-8 AND THE MATCHING INTERFACE"
131     .rs
132     .sp
133     By default, pattern and text are plain text, one byte per character. The UTF8
134     flag, passed to the constructor, causes both pattern and string to be treated
135     as UTF-8 text, still a byte stream but potentially multiple bytes per
136     character. In practice, the text is likelier to be UTF-8 than the pattern, but
137     the match returned may depend on the UTF8 flag, so always use it when matching
138     UTF8 text. For example, "." will match one byte normally but with UTF8 set may
139     match up to three bytes of a multi-byte character.
140     .sp
141     Example:
142     pcrecpp::RE_Options options;
143     options.set_utf8();
144     pcrecpp::RE re(utf8_pattern, options);
145     re.FullMatch(utf8_string);
146     .sp
147     Example: using the convenience function UTF8():
148     pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
149     re.FullMatch(utf8_string);
150     .sp
151     NOTE: The UTF8 flag is ignored if pcre was not configured with the
152     --enable-utf8 flag.
153     .
154     .
155 nigel 81 .SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
156     .rs
157     .sp
158     PCRE defines some modifiers to change the behavior of the regular expression
159     engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
160     pass such modifiers to a RE class. Currently, the following modifiers are
161     supported:
162     .sp
163     modifier description Perl corresponding
164     .sp
165     PCRE_CASELESS case insensitive match /i
166     PCRE_MULTILINE multiple lines match /m
167     PCRE_DOTALL dot matches newlines /s
168     PCRE_DOLLAR_ENDONLY $ matches only at end N/A
169     PCRE_EXTRA strict escape parsing N/A
170     PCRE_EXTENDED ignore whitespaces /x
171     PCRE_UTF8 handles UTF8 chars built-in
172     PCRE_UNGREEDY reverses * and *? N/A
173     PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
174     .sp
175     (*) Both Perl and PCRE allow non capturing parentheses by means of the
176     "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
177     capture, while (ab|cd) does.
178     .P
179     For a full account on how each modifier works, please check the
180     PCRE API reference page.
181     .P
182     For each modifier, there are two member functions whose name is made
183     out of the modifier in lowercase, without the "PCRE_" prefix. For
184     instance, PCRE_CASELESS is handled by
185     .sp
186     bool caseless()
187     .sp
188     which returns true if the modifier is set, and
189     .sp
190     RE_Options & set_caseless(bool)
191     .sp
192 nigel 87 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
193 nigel 81 accessed through the \fBset_match_limit()\fR and \fBmatch_limit()\fR member
194     functions. Setting \fImatch_limit\fR to a non-zero value will limit the
195     execution of pcre to keep it from doing bad things like blowing the stack or
196     taking an eternity to return a result. A value of 5000 is good enough to stop
197     stack blowup in a 2MB thread stack. Setting \fImatch_limit\fR to zero disables
198 nigel 87 match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
199     which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
200     recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
201     \fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
202     therefore the amount of stack that is used.
203 nigel 81 .P
204     Normally, to pass one or more modifiers to a RE class, you declare
205     a \fIRE_Options\fR object, set the appropriate options, and pass this
206     object to a RE constructor. Example:
207     .sp
208     RE_options opt;
209     opt.set_caseless(true);
210     if (RE("HELLO", opt).PartialMatch("hello world")) ...
211     .sp
212     RE_options has two constructors. The default constructor takes no arguments and
213     creates a set of flags that are off by default. The optional parameter
214     \fIoption_flags\fR is to facilitate transfer of legacy code from C programs.
215     This lets you do
216     .sp
217     RE(pattern,
218     RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
219     .sp
220     However, new code is better off doing
221     .sp
222     RE(pattern,
223     RE_Options().set_caseless(true).set_multiline(true))
224     .PartialMatch(str);
225     .sp
226     If you are going to pass one of the most used modifiers, there are some
227     convenience functions that return a RE_Options class with the
228     appropriate modifier already set: \fBCASELESS()\fR, \fBUTF8()\fR,
229     \fBMULTILINE()\fR, \fBDOTALL\fR(), and \fBEXTENDED()\fR.
230     .P
231     If you need to set several options at once, and you don't want to go through
232     the pains of declaring a RE_Options object and setting several options, there
233     is a parallel method that give you such ability on the fly. You can concatenate
234     several \fBset_xxxxx()\fR member functions, since each of them returns a
235     reference to its class object. For example, to pass PCRE_CASELESS,
236     PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
237     .sp
238     RE(" ^ xyz \e\es+ .* blah$",
239     RE_Options()
240     .set_caseless(true)
241     .set_extended(true)
242     .set_multiline(true)).PartialMatch(sometext);
243     .sp
244     .
245     .
246 nigel 77 .SH "SCANNING TEXT INCREMENTALLY"
247     .rs
248     .sp
249     The "Consume" operation may be useful if you want to repeatedly
250     match regular expressions at the front of a string and skip over
251     them as they match. This requires use of the "StringPiece" type,
252     which represents a sub-range of a real string. Like RE, StringPiece
253     is defined in the pcrecpp namespace.
254     .sp
255     Example: read lines of the form "var = value" from a string.
256     string contents = ...; // Fill string somehow
257     pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
258    
259     string var;
260     int value;
261     pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
262     while (re.Consume(&input, &var, &value)) {
263     ...;
264     }
265     .sp
266     Each successful call to "Consume" will set "var/value", and also
267     advance "input" so it points past the matched text.
268     .P
269     The "FindAndConsume" operation is similar to "Consume" but does not
270     anchor your match at the beginning of the string. For example, you
271     could extract all words from a string by repeatedly calling
272     .sp
273     pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
274     .
275     .
276     .SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
277     .rs
278     .sp
279     By default, if you pass a pointer to a numeric value, the
280     corresponding text is interpreted as a base-10 number. You can
281     instead wrap the pointer with a call to one of the operators Hex(),
282     Octal(), or CRadix() to interpret the text in another base. The
283     CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
284     prefixes, but defaults to base-10.
285     .sp
286     Example:
287     int a, b, c, d;
288     pcrecpp::RE re("(.*) (.*) (.*) (.*)");
289     re.FullMatch("100 40 0100 0x40",
290     pcrecpp::Octal(&a), pcrecpp::Hex(&b),
291     pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
292     .sp
293     will leave 64 in a, b, c, and d.
294     .
295     .
296     .SH "REPLACING PARTS OF STRINGS"
297     .rs
298     .sp
299     You can replace the first match of "pattern" in "str" with "rewrite".
300     Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
301     used to insert text matching corresponding parenthesized group
302     from the pattern. \e0 in "rewrite" refers to the entire matching
303     text. For example:
304     .sp
305     string s = "yabba dabba doo";
306     pcrecpp::RE("b+").Replace("d", &s);
307     .sp
308     will leave "s" containing "yada dabba doo". The result is true if the pattern
309     matches and a replacement occurs, false otherwise.
310     .P
311     \fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
312     occurrences of the pattern in the string with the rewrite. Replacements are
313     not subject to re-matching. For example:
314     .sp
315     string s = "yabba dabba doo";
316     pcrecpp::RE("b+").GlobalReplace("d", &s);
317     .sp
318     will leave "s" containing "yada dada doo". It returns the number of
319     replacements made.
320     .P
321     \fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
322     "rewrite" is copied into "out" (an additional argument) with substitutions.
323     The non-matching portions of "text" are ignored. Returns true iff a match
324     occurred and the extraction happened successfully; if no match occurs, the
325     string is left unaffected.
326     .
327     .
328     .SH AUTHOR
329     .rs
330     .sp
331 ph10 99 .nf
332 nigel 77 The C++ wrapper was contributed by Google Inc.
333 ph10 117 Copyright (c) 2007 Google Inc.
334 ph10 99 .fi
335     .
336     .
337     .SH REVISION
338     .rs
339     .sp
340     .nf
341 ph10 263 Last updated: 12 November 2007
342 ph10 99 .fi

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12