/[pcre]/code/trunk/doc/pcrecpp.3
ViewVC logotype

Contents of /code/trunk/doc/pcrecpp.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1055 - (hide annotations) (download)
Tue Oct 16 15:53:30 2012 UTC (22 months, 1 week ago) by chpe
File size: 12709 byte(s)
pcre32: Add 32-bit library

Create libpcre32 that operates on 32-bit characters (UTF-32).

This turned out to be surprisingly simple after the UTF-16 support
was introduced; mostly just extra ifdefs and adjusting and adding
some tests.
1 ph10 954 .TH PCRECPP 3 "08 January 2012" "PCRE 8.30"
2 nigel 77 .SH NAME
3     PCRE - Perl-compatible regular expressions.
4     .SH "SYNOPSIS OF C++ WRAPPER"
5     .rs
6     .sp
7     .B #include <pcrecpp.h>
8 ph10 99 .
9 nigel 77 .SH DESCRIPTION
10     .rs
11     .sp
12 nigel 81 The C++ wrapper for PCRE was provided by Google Inc. Some additional
13     functionality was added by Giuseppe Maxia. This brief man page was constructed
14     from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
15 ph10 903 further details. Note that the C++ wrapper supports only the original 8-bit
16 chpe 1055 PCRE library. There is no 16-bit or 32-bit support at present.
17 nigel 77 .
18     .
19     .SH "MATCHING INTERFACE"
20     .rs
21     .sp
22     The "FullMatch" operation checks that supplied text matches a supplied pattern
23     exactly. If pointer arguments are supplied, it copies matched sub-strings that
24     match sub-patterns into them.
25     .sp
26     Example: successful match
27     pcrecpp::RE re("h.*o");
28     re.FullMatch("hello");
29     .sp
30     Example: unsuccessful match (requires full match):
31     pcrecpp::RE re("e");
32     !re.FullMatch("hello");
33     .sp
34     Example: creating a temporary RE object:
35     pcrecpp::RE("h.*o").FullMatch("hello");
36     .sp
37     You can pass in a "const char*" or a "string" for "text". The examples below
38     tend to use a const char*. You can, as in the different examples above, store
39     the RE object explicitly in a variable or use a temporary RE object. The
40     examples below use one mode or the other arbitrarily. Either could correctly be
41     used for any of these examples.
42     .P
43     You must supply extra pointer arguments to extract matched subpieces.
44     .sp
45     Example: extracts "ruby" into "s" and 1234 into "i"
46     int i;
47     string s;
48     pcrecpp::RE re("(\e\ew+):(\e\ed+)");
49     re.FullMatch("ruby:1234", &s, &i);
50     .sp
51     Example: does not try to extract any extra sub-patterns
52     re.FullMatch("ruby:1234", &s);
53     .sp
54     Example: does not try to extract into NULL
55     re.FullMatch("ruby:1234", NULL, &i);
56     .sp
57     Example: integer overflow causes failure
58     !re.FullMatch("ruby:1234567891234", NULL, &i);
59     .sp
60     Example: fails because there aren't enough sub-patterns:
61     !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
62     .sp
63     Example: fails because string cannot be stored in integer
64     !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
65     .sp
66     The provided pointer arguments can be pointers to any scalar numeric
67     type, or one of:
68     .sp
69     string (matched piece is copied to string)
70     StringPiece (StringPiece is mutated to point to matched piece)
71     T (where "bool T::ParseFrom(const char*, int)" exists)
72     NULL (the corresponding matched sub-pattern is not copied)
73     .sp
74     The function returns true iff all of the following conditions are satisfied:
75     .sp
76     a. "text" matches "pattern" exactly;
77     .sp
78     b. The number of matched sub-patterns is >= number of supplied
79     pointers;
80     .sp
81     c. The "i"th argument has a suitable type for holding the
82     string captured as the "i"th sub-pattern. If you pass in
83 ph10 263 void * NULL for the "i"th argument, or a non-void * NULL
84 ph10 286 of the correct type, or pass fewer arguments than the
85 nigel 77 number of sub-patterns, "i"th captured sub-pattern is
86     ignored.
87     .sp
88 nigel 93 CAVEAT: An optional sub-pattern that does not exist in the matched
89     string is assigned the empty string. Therefore, the following will
90     return false (because the empty string is not a valid number):
91     .sp
92     int number;
93 ph10 155 pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number);
94 nigel 93 .sp
95 nigel 77 The matching interface supports at most 16 arguments per call.
96     If you need more, consider using the more general interface
97     \fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
98     \fBDoMatch\fP.
99 ph10 390 .P
100 ph10 392 NOTE: Do not use \fBno_arg\fP, which is used internally to mark the end of a
101     list of optional arguments, as a placeholder for missing arguments, as this can
102 ph10 390 lead to segfaults.
103 nigel 77 .
104 ph10 390 .
105 nigel 93 .SH "QUOTING METACHARACTERS"
106     .rs
107     .sp
108     You can use the "QuoteMeta" operation to insert backslashes before all
109     potentially meaningful characters in a string. The returned string, used as a
110     regular expression, will exactly match the original string.
111     .sp
112     Example:
113     string quoted = RE::QuoteMeta(unquoted);
114     .sp
115     Note that it's legal to escape a character even if it has no special meaning in
116     a regular expression -- so this function does that. (This also makes it
117     identical to the perl function of the same name; see "perldoc -f quotemeta".)
118     For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
119     .
120 nigel 77 .SH "PARTIAL MATCHES"
121     .rs
122     .sp
123     You can use the "PartialMatch" operation when you want the pattern
124     to match any substring of the text.
125     .sp
126     Example: simple search for a string:
127     pcrecpp::RE("ell").PartialMatch("hello");
128     .sp
129     Example: find first number in a string:
130     int number;
131     pcrecpp::RE re("(\e\ed+)");
132     re.PartialMatch("x*100 + 20", &number);
133     assert(number == 100);
134     .
135     .
136     .SH "UTF-8 AND THE MATCHING INTERFACE"
137     .rs
138     .sp
139     By default, pattern and text are plain text, one byte per character. The UTF8
140     flag, passed to the constructor, causes both pattern and string to be treated
141     as UTF-8 text, still a byte stream but potentially multiple bytes per
142     character. In practice, the text is likelier to be UTF-8 than the pattern, but
143     the match returned may depend on the UTF8 flag, so always use it when matching
144     UTF8 text. For example, "." will match one byte normally but with UTF8 set may
145     match up to three bytes of a multi-byte character.
146     .sp
147     Example:
148     pcrecpp::RE_Options options;
149     options.set_utf8();
150     pcrecpp::RE re(utf8_pattern, options);
151     re.FullMatch(utf8_string);
152     .sp
153     Example: using the convenience function UTF8():
154     pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
155     re.FullMatch(utf8_string);
156     .sp
157     NOTE: The UTF8 flag is ignored if pcre was not configured with the
158     --enable-utf8 flag.
159     .
160     .
161 nigel 81 .SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
162     .rs
163     .sp
164     PCRE defines some modifiers to change the behavior of the regular expression
165     engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
166     pass such modifiers to a RE class. Currently, the following modifiers are
167     supported:
168     .sp
169     modifier description Perl corresponding
170     .sp
171     PCRE_CASELESS case insensitive match /i
172     PCRE_MULTILINE multiple lines match /m
173     PCRE_DOTALL dot matches newlines /s
174     PCRE_DOLLAR_ENDONLY $ matches only at end N/A
175     PCRE_EXTRA strict escape parsing N/A
176 ph10 968 PCRE_EXTENDED ignore white spaces /x
177 nigel 81 PCRE_UTF8 handles UTF8 chars built-in
178     PCRE_UNGREEDY reverses * and *? N/A
179     PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
180     .sp
181     (*) Both Perl and PCRE allow non capturing parentheses by means of the
182     "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
183     capture, while (ab|cd) does.
184     .P
185     For a full account on how each modifier works, please check the
186     PCRE API reference page.
187     .P
188     For each modifier, there are two member functions whose name is made
189     out of the modifier in lowercase, without the "PCRE_" prefix. For
190     instance, PCRE_CASELESS is handled by
191     .sp
192     bool caseless()
193     .sp
194     which returns true if the modifier is set, and
195     .sp
196     RE_Options & set_caseless(bool)
197     .sp
198 nigel 87 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
199 ph10 583 accessed through the \fBset_match_limit()\fP and \fBmatch_limit()\fP member
200     functions. Setting \fImatch_limit\fP to a non-zero value will limit the
201 nigel 81 execution of pcre to keep it from doing bad things like blowing the stack or
202     taking an eternity to return a result. A value of 5000 is good enough to stop
203 ph10 583 stack blowup in a 2MB thread stack. Setting \fImatch_limit\fP to zero disables
204 nigel 87 match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
205     which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
206     recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
207     \fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
208     therefore the amount of stack that is used.
209 nigel 81 .P
210     Normally, to pass one or more modifiers to a RE class, you declare
211 ph10 583 a \fIRE_Options\fP object, set the appropriate options, and pass this
212 nigel 81 object to a RE constructor. Example:
213     .sp
214 ph10 639 RE_Options opt;
215 nigel 81 opt.set_caseless(true);
216     if (RE("HELLO", opt).PartialMatch("hello world")) ...
217     .sp
218     RE_options has two constructors. The default constructor takes no arguments and
219     creates a set of flags that are off by default. The optional parameter
220 ph10 583 \fIoption_flags\fP is to facilitate transfer of legacy code from C programs.
221 nigel 81 This lets you do
222     .sp
223     RE(pattern,
224     RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
225     .sp
226     However, new code is better off doing
227     .sp
228     RE(pattern,
229     RE_Options().set_caseless(true).set_multiline(true))
230     .PartialMatch(str);
231     .sp
232     If you are going to pass one of the most used modifiers, there are some
233     convenience functions that return a RE_Options class with the
234 ph10 583 appropriate modifier already set: \fBCASELESS()\fP, \fBUTF8()\fP,
235     \fBMULTILINE()\fP, \fBDOTALL\fP(), and \fBEXTENDED()\fP.
236 nigel 81 .P
237     If you need to set several options at once, and you don't want to go through
238     the pains of declaring a RE_Options object and setting several options, there
239     is a parallel method that give you such ability on the fly. You can concatenate
240 ph10 583 several \fBset_xxxxx()\fP member functions, since each of them returns a
241 nigel 81 reference to its class object. For example, to pass PCRE_CASELESS,
242     PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
243     .sp
244     RE(" ^ xyz \e\es+ .* blah$",
245     RE_Options()
246     .set_caseless(true)
247     .set_extended(true)
248     .set_multiline(true)).PartialMatch(sometext);
249     .sp
250     .
251     .
252 nigel 77 .SH "SCANNING TEXT INCREMENTALLY"
253     .rs
254     .sp
255     The "Consume" operation may be useful if you want to repeatedly
256     match regular expressions at the front of a string and skip over
257     them as they match. This requires use of the "StringPiece" type,
258     which represents a sub-range of a real string. Like RE, StringPiece
259     is defined in the pcrecpp namespace.
260     .sp
261     Example: read lines of the form "var = value" from a string.
262     string contents = ...; // Fill string somehow
263     pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
264 ph10 583 .sp
265 nigel 77 string var;
266     int value;
267     pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
268     while (re.Consume(&input, &var, &value)) {
269     ...;
270     }
271     .sp
272     Each successful call to "Consume" will set "var/value", and also
273     advance "input" so it points past the matched text.
274     .P
275     The "FindAndConsume" operation is similar to "Consume" but does not
276     anchor your match at the beginning of the string. For example, you
277     could extract all words from a string by repeatedly calling
278     .sp
279     pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
280     .
281     .
282     .SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
283     .rs
284     .sp
285     By default, if you pass a pointer to a numeric value, the
286     corresponding text is interpreted as a base-10 number. You can
287     instead wrap the pointer with a call to one of the operators Hex(),
288     Octal(), or CRadix() to interpret the text in another base. The
289     CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
290     prefixes, but defaults to base-10.
291     .sp
292     Example:
293     int a, b, c, d;
294     pcrecpp::RE re("(.*) (.*) (.*) (.*)");
295     re.FullMatch("100 40 0100 0x40",
296     pcrecpp::Octal(&a), pcrecpp::Hex(&b),
297     pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
298     .sp
299     will leave 64 in a, b, c, and d.
300     .
301     .
302     .SH "REPLACING PARTS OF STRINGS"
303     .rs
304     .sp
305     You can replace the first match of "pattern" in "str" with "rewrite".
306     Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
307     used to insert text matching corresponding parenthesized group
308     from the pattern. \e0 in "rewrite" refers to the entire matching
309     text. For example:
310     .sp
311     string s = "yabba dabba doo";
312     pcrecpp::RE("b+").Replace("d", &s);
313     .sp
314     will leave "s" containing "yada dabba doo". The result is true if the pattern
315     matches and a replacement occurs, false otherwise.
316     .P
317     \fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
318     occurrences of the pattern in the string with the rewrite. Replacements are
319     not subject to re-matching. For example:
320     .sp
321     string s = "yabba dabba doo";
322     pcrecpp::RE("b+").GlobalReplace("d", &s);
323     .sp
324     will leave "s" containing "yada dada doo". It returns the number of
325     replacements made.
326     .P
327     \fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
328     "rewrite" is copied into "out" (an additional argument) with substitutions.
329     The non-matching portions of "text" are ignored. Returns true iff a match
330     occurred and the extraction happened successfully; if no match occurs, the
331     string is left unaffected.
332     .
333     .
334     .SH AUTHOR
335     .rs
336     .sp
337 ph10 99 .nf
338 nigel 77 The C++ wrapper was contributed by Google Inc.
339 ph10 117 Copyright (c) 2007 Google Inc.
340 ph10 99 .fi
341     .
342     .
343     .SH REVISION
344     .rs
345     .sp
346     .nf
347 ph10 858 Last updated: 08 January 2012
348 ph10 99 .fi

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12