ViewVC logotype

Contents of /code/trunk/doc/html/pcrecpp.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 654 - (hide annotations) (download) (as text)
Tue Aug 2 11:00:40 2011 UTC (3 years, 8 months ago) by ph10
File MIME type: text/html
File size: 14307 byte(s)
Documentation and general text tidies in preparation for test release.

1 nigel 77 <html>
2     <head>
3     <title>pcrecpp specification</title>
4     </head>
5     <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6     <h1>pcrecpp man page</h1>
7     <p>
8     Return to the <a href="index.html">PCRE index page</a>.
9     </p>
10 ph10 111 <p>
11 nigel 77 This page is part of the PCRE HTML documentation. It was generated automatically
12     from the original man page. If there is any nonsense in it, please consult the
13     man page, in case the conversion went wrong.
14 ph10 111 <br>
15 nigel 77 <ul>
16     <li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a>
17     <li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
18     <li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a>
19 nigel 93 <li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a>
20     <li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a>
21     <li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a>
23     <li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a>
24     <li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a>
25     <li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a>
26     <li><a name="TOC11" href="#SEC11">AUTHOR</a>
27 ph10 99 <li><a name="TOC12" href="#SEC12">REVISION</a>
28 nigel 77 </ul>
29     <br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br>
30     <P>
31     <b>#include &#60;pcrecpp.h&#62;</b>
32     </P>
33     <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
34     <P>
35 nigel 81 The C++ wrapper for PCRE was provided by Google Inc. Some additional
36     functionality was added by Giuseppe Maxia. This brief man page was constructed
37     from the notes in the <i>pcrecpp.h</i> file, which should be consulted for
38     further details.
39 nigel 77 </P>
40     <br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br>
41     <P>
42     The "FullMatch" operation checks that supplied text matches a supplied pattern
43     exactly. If pointer arguments are supplied, it copies matched sub-strings that
44     match sub-patterns into them.
45     <pre>
46     Example: successful match
47     pcrecpp::RE re("h.*o");
48     re.FullMatch("hello");
50     Example: unsuccessful match (requires full match):
51     pcrecpp::RE re("e");
52     !re.FullMatch("hello");
54     Example: creating a temporary RE object:
55     pcrecpp::RE("h.*o").FullMatch("hello");
56     </pre>
57     You can pass in a "const char*" or a "string" for "text". The examples below
58     tend to use a const char*. You can, as in the different examples above, store
59     the RE object explicitly in a variable or use a temporary RE object. The
60     examples below use one mode or the other arbitrarily. Either could correctly be
61     used for any of these examples.
62     </P>
63     <P>
64     You must supply extra pointer arguments to extract matched subpieces.
65     <pre>
66     Example: extracts "ruby" into "s" and 1234 into "i"
67     int i;
68     string s;
69     pcrecpp::RE re("(\\w+):(\\d+)");
70     re.FullMatch("ruby:1234", &s, &i);
72     Example: does not try to extract any extra sub-patterns
73     re.FullMatch("ruby:1234", &s);
75     Example: does not try to extract into NULL
76     re.FullMatch("ruby:1234", NULL, &i);
78     Example: integer overflow causes failure
79     !re.FullMatch("ruby:1234567891234", NULL, &i);
81     Example: fails because there aren't enough sub-patterns:
82     !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
84     Example: fails because string cannot be stored in integer
85     !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
86     </pre>
87     The provided pointer arguments can be pointers to any scalar numeric
88     type, or one of:
89     <pre>
90     string (matched piece is copied to string)
91     StringPiece (StringPiece is mutated to point to matched piece)
92     T (where "bool T::ParseFrom(const char*, int)" exists)
93     NULL (the corresponding matched sub-pattern is not copied)
94     </pre>
95     The function returns true iff all of the following conditions are satisfied:
96     <pre>
97     a. "text" matches "pattern" exactly;
99     b. The number of matched sub-patterns is &#62;= number of supplied
100     pointers;
102     c. The "i"th argument has a suitable type for holding the
103     string captured as the "i"th sub-pattern. If you pass in
104 ph10 286 void * NULL for the "i"th argument, or a non-void * NULL
105     of the correct type, or pass fewer arguments than the
106 nigel 77 number of sub-patterns, "i"th captured sub-pattern is
107     ignored.
108     </pre>
109 nigel 93 CAVEAT: An optional sub-pattern that does not exist in the matched
110     string is assigned the empty string. Therefore, the following will
111     return false (because the empty string is not a valid number):
112     <pre>
113     int number;
114     pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
115     </pre>
116 nigel 77 The matching interface supports at most 16 arguments per call.
117     If you need more, consider using the more general interface
118     <b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for
119     <b>DoMatch</b>.
120     </P>
121 ph10 392 <P>
122     NOTE: Do not use <b>no_arg</b>, which is used internally to mark the end of a
123     list of optional arguments, as a placeholder for missing arguments, as this can
124     lead to segfaults.
125     </P>
126 nigel 93 <br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br>
127 nigel 77 <P>
128 nigel 93 You can use the "QuoteMeta" operation to insert backslashes before all
129     potentially meaningful characters in a string. The returned string, used as a
130     regular expression, will exactly match the original string.
131     <pre>
132     Example:
133     string quoted = RE::QuoteMeta(unquoted);
134     </pre>
135     Note that it's legal to escape a character even if it has no special meaning in
136     a regular expression -- so this function does that. (This also makes it
137     identical to the perl function of the same name; see "perldoc -f quotemeta".)
138     For example, "1.5-2.0?" becomes "1\.5\-2\.0\?".
139     </P>
140     <br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br>
141     <P>
142 nigel 77 You can use the "PartialMatch" operation when you want the pattern
143     to match any substring of the text.
144     <pre>
145     Example: simple search for a string:
146     pcrecpp::RE("ell").PartialMatch("hello");
148     Example: find first number in a string:
149     int number;
150     pcrecpp::RE re("(\\d+)");
151     re.PartialMatch("x*100 + 20", &number);
152     assert(number == 100);
153     </PRE>
154     </P>
155 nigel 93 <br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br>
156 nigel 77 <P>
157     By default, pattern and text are plain text, one byte per character. The UTF8
158     flag, passed to the constructor, causes both pattern and string to be treated
159     as UTF-8 text, still a byte stream but potentially multiple bytes per
160     character. In practice, the text is likelier to be UTF-8 than the pattern, but
161     the match returned may depend on the UTF8 flag, so always use it when matching
162     UTF8 text. For example, "." will match one byte normally but with UTF8 set may
163     match up to three bytes of a multi-byte character.
164     <pre>
165     Example:
166     pcrecpp::RE_Options options;
167     options.set_utf8();
168     pcrecpp::RE re(utf8_pattern, options);
169     re.FullMatch(utf8_string);
171     Example: using the convenience function UTF8():
172     pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
173     re.FullMatch(utf8_string);
174     </pre>
175     NOTE: The UTF8 flag is ignored if pcre was not configured with the
176     <pre>
177     --enable-utf8 flag.
178     </PRE>
179     </P>
180 nigel 93 <br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br>
181 nigel 77 <P>
182 nigel 81 PCRE defines some modifiers to change the behavior of the regular expression
183     engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
184     pass such modifiers to a RE class. Currently, the following modifiers are
185     supported:
186     <pre>
187     modifier description Perl corresponding
189     PCRE_CASELESS case insensitive match /i
190     PCRE_MULTILINE multiple lines match /m
191     PCRE_DOTALL dot matches newlines /s
192     PCRE_DOLLAR_ENDONLY $ matches only at end N/A
193     PCRE_EXTRA strict escape parsing N/A
194     PCRE_EXTENDED ignore whitespaces /x
195     PCRE_UTF8 handles UTF8 chars built-in
196     PCRE_UNGREEDY reverses * and *? N/A
197     PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
198     </pre>
199     (*) Both Perl and PCRE allow non capturing parentheses by means of the
200     "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
201     capture, while (ab|cd) does.
202     </P>
203     <P>
204     For a full account on how each modifier works, please check the
205     PCRE API reference page.
206     </P>
207     <P>
208     For each modifier, there are two member functions whose name is made
209     out of the modifier in lowercase, without the "PCRE_" prefix. For
210     instance, PCRE_CASELESS is handled by
211     <pre>
212     bool caseless()
213     </pre>
214     which returns true if the modifier is set, and
215     <pre>
216     RE_Options & set_caseless(bool)
217     </pre>
218 nigel 87 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
219 nigel 81 accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member
220     functions. Setting <i>match_limit</i> to a non-zero value will limit the
221     execution of pcre to keep it from doing bad things like blowing the stack or
222     taking an eternity to return a result. A value of 5000 is good enough to stop
223     stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables
224 nigel 87 match limiting. Alternatively, you can call <b>match_limit_recursion()</b>
225     which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
226     recurses. <b>match_limit()</b> limits the number of matches PCRE does;
227     <b>match_limit_recursion()</b> limits the depth of internal recursion, and
228     therefore the amount of stack that is used.
229 nigel 81 </P>
230     <P>
231     Normally, to pass one or more modifiers to a RE class, you declare
232     a <i>RE_Options</i> object, set the appropriate options, and pass this
233     object to a RE constructor. Example:
234     <pre>
235 ph10 654 RE_Options opt;
236 nigel 81 opt.set_caseless(true);
237     if (RE("HELLO", opt).PartialMatch("hello world")) ...
238     </pre>
239     RE_options has two constructors. The default constructor takes no arguments and
240     creates a set of flags that are off by default. The optional parameter
241     <i>option_flags</i> is to facilitate transfer of legacy code from C programs.
242     This lets you do
243     <pre>
244     RE(pattern,
245     RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
246     </pre>
247     However, new code is better off doing
248     <pre>
249     RE(pattern,
250     RE_Options().set_caseless(true).set_multiline(true))
251     .PartialMatch(str);
252     </pre>
253     If you are going to pass one of the most used modifiers, there are some
254     convenience functions that return a RE_Options class with the
255     appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>,
256     <b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>.
257     </P>
258     <P>
259     If you need to set several options at once, and you don't want to go through
260     the pains of declaring a RE_Options object and setting several options, there
261     is a parallel method that give you such ability on the fly. You can concatenate
262     several <b>set_xxxxx()</b> member functions, since each of them returns a
263     reference to its class object. For example, to pass PCRE_CASELESS,
264     PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
265     <pre>
266     RE(" ^ xyz \\s+ .* blah$",
267     RE_Options()
268     .set_caseless(true)
269     .set_extended(true)
270     .set_multiline(true)).PartialMatch(sometext);
272     </PRE>
273     </P>
274 nigel 93 <br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br>
275 nigel 81 <P>
276 nigel 77 The "Consume" operation may be useful if you want to repeatedly
277     match regular expressions at the front of a string and skip over
278     them as they match. This requires use of the "StringPiece" type,
279     which represents a sub-range of a real string. Like RE, StringPiece
280     is defined in the pcrecpp namespace.
281     <pre>
282     Example: read lines of the form "var = value" from a string.
283     string contents = ...; // Fill string somehow
284     pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
285 ph10 583
286 nigel 77 string var;
287     int value;
288     pcrecpp::RE re("(\\w+) = (\\d+)\n");
289     while (re.Consume(&input, &var, &value)) {
290     ...;
291     }
292     </pre>
293     Each successful call to "Consume" will set "var/value", and also
294     advance "input" so it points past the matched text.
295     </P>
296     <P>
297     The "FindAndConsume" operation is similar to "Consume" but does not
298     anchor your match at the beginning of the string. For example, you
299     could extract all words from a string by repeatedly calling
300     <pre>
301     pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
302     </PRE>
303     </P>
304 nigel 93 <br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br>
305 nigel 77 <P>
306     By default, if you pass a pointer to a numeric value, the
307     corresponding text is interpreted as a base-10 number. You can
308     instead wrap the pointer with a call to one of the operators Hex(),
309     Octal(), or CRadix() to interpret the text in another base. The
310     CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
311     prefixes, but defaults to base-10.
312     <pre>
313     Example:
314     int a, b, c, d;
315     pcrecpp::RE re("(.*) (.*) (.*) (.*)");
316     re.FullMatch("100 40 0100 0x40",
317     pcrecpp::Octal(&a), pcrecpp::Hex(&b),
318     pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
319     </pre>
320     will leave 64 in a, b, c, and d.
321     </P>
322 nigel 93 <br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br>
323 nigel 77 <P>
324     You can replace the first match of "pattern" in "str" with "rewrite".
325     Within "rewrite", backslash-escaped digits (\1 to \9) can be
326     used to insert text matching corresponding parenthesized group
327     from the pattern. \0 in "rewrite" refers to the entire matching
328     text. For example:
329     <pre>
330     string s = "yabba dabba doo";
331     pcrecpp::RE("b+").Replace("d", &s);
332     </pre>
333     will leave "s" containing "yada dabba doo". The result is true if the pattern
334     matches and a replacement occurs, false otherwise.
335     </P>
336     <P>
337     <b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all
338     occurrences of the pattern in the string with the rewrite. Replacements are
339     not subject to re-matching. For example:
340     <pre>
341     string s = "yabba dabba doo";
342     pcrecpp::RE("b+").GlobalReplace("d", &s);
343     </pre>
344     will leave "s" containing "yada dada doo". It returns the number of
345     replacements made.
346     </P>
347     <P>
348     <b>Extract</b> is like <b>Replace</b>, except that if the pattern matches,
349     "rewrite" is copied into "out" (an additional argument) with substitutions.
350     The non-matching portions of "text" are ignored. Returns true iff a match
351     occurred and the extraction happened successfully; if no match occurs, the
352     string is left unaffected.
353     </P>
354 nigel 93 <br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
355 nigel 77 <P>
356     The C++ wrapper was contributed by Google Inc.
357     <br>
358 ph10 123 Copyright &copy; 2007 Google Inc.
359 ph10 99 <br>
360     </P>
361     <br><a name="SEC12" href="#TOC1">REVISION</a><br>
362     <P>
363 ph10 392 Last updated: 17 March 2009
364 ph10 99 <br>
365 ph10 654 Minor typo fixed: 25 July 2011
366     <br>
367 nigel 77 <p>
368     Return to the <a href="index.html">PCRE index page</a>.
369     </p>


Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

ViewVC Help
Powered by ViewVC 1.1.12