| 1 |
<html>
|
| 2 |
<head>
|
| 3 |
<title>pcrecpp specification</title>
|
| 4 |
</head>
|
| 5 |
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
|
| 6 |
<h1>pcrecpp man page</h1>
|
| 7 |
<p>
|
| 8 |
Return to the <a href="index.html">PCRE index page</a>.
|
| 9 |
</p>
|
| 10 |
<p>
|
| 11 |
This page is part of the PCRE HTML documentation. It was generated automatically
|
| 12 |
from the original man page. If there is any nonsense in it, please consult the
|
| 13 |
man page, in case the conversion went wrong.
|
| 14 |
<br>
|
| 15 |
<ul>
|
| 16 |
<li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a>
|
| 17 |
<li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
|
| 18 |
<li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a>
|
| 19 |
<li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a>
|
| 20 |
<li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a>
|
| 21 |
<li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a>
|
| 22 |
<li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a>
|
| 23 |
<li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a>
|
| 24 |
<li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a>
|
| 25 |
<li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a>
|
| 26 |
<li><a name="TOC11" href="#SEC11">AUTHOR</a>
|
| 27 |
<li><a name="TOC12" href="#SEC12">REVISION</a>
|
| 28 |
</ul>
|
| 29 |
<br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br>
|
| 30 |
<P>
|
| 31 |
<b>#include <pcrecpp.h></b>
|
| 32 |
</P>
|
| 33 |
<br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
|
| 34 |
<P>
|
| 35 |
The C++ wrapper for PCRE was provided by Google Inc. Some additional
|
| 36 |
functionality was added by Giuseppe Maxia. This brief man page was constructed
|
| 37 |
from the notes in the <i>pcrecpp.h</i> file, which should be consulted for
|
| 38 |
further details.
|
| 39 |
</P>
|
| 40 |
<br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br>
|
| 41 |
<P>
|
| 42 |
The "FullMatch" operation checks that supplied text matches a supplied pattern
|
| 43 |
exactly. If pointer arguments are supplied, it copies matched sub-strings that
|
| 44 |
match sub-patterns into them.
|
| 45 |
<pre>
|
| 46 |
Example: successful match
|
| 47 |
pcrecpp::RE re("h.*o");
|
| 48 |
re.FullMatch("hello");
|
| 49 |
|
| 50 |
Example: unsuccessful match (requires full match):
|
| 51 |
pcrecpp::RE re("e");
|
| 52 |
!re.FullMatch("hello");
|
| 53 |
|
| 54 |
Example: creating a temporary RE object:
|
| 55 |
pcrecpp::RE("h.*o").FullMatch("hello");
|
| 56 |
</pre>
|
| 57 |
You can pass in a "const char*" or a "string" for "text". The examples below
|
| 58 |
tend to use a const char*. You can, as in the different examples above, store
|
| 59 |
the RE object explicitly in a variable or use a temporary RE object. The
|
| 60 |
examples below use one mode or the other arbitrarily. Either could correctly be
|
| 61 |
used for any of these examples.
|
| 62 |
</P>
|
| 63 |
<P>
|
| 64 |
You must supply extra pointer arguments to extract matched subpieces.
|
| 65 |
<pre>
|
| 66 |
Example: extracts "ruby" into "s" and 1234 into "i"
|
| 67 |
int i;
|
| 68 |
string s;
|
| 69 |
pcrecpp::RE re("(\\w+):(\\d+)");
|
| 70 |
re.FullMatch("ruby:1234", &s, &i);
|
| 71 |
|
| 72 |
Example: does not try to extract any extra sub-patterns
|
| 73 |
re.FullMatch("ruby:1234", &s);
|
| 74 |
|
| 75 |
Example: does not try to extract into NULL
|
| 76 |
re.FullMatch("ruby:1234", NULL, &i);
|
| 77 |
|
| 78 |
Example: integer overflow causes failure
|
| 79 |
!re.FullMatch("ruby:1234567891234", NULL, &i);
|
| 80 |
|
| 81 |
Example: fails because there aren't enough sub-patterns:
|
| 82 |
!pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
|
| 83 |
|
| 84 |
Example: fails because string cannot be stored in integer
|
| 85 |
!pcrecpp::RE("(.*)").FullMatch("ruby", &i);
|
| 86 |
</pre>
|
| 87 |
The provided pointer arguments can be pointers to any scalar numeric
|
| 88 |
type, or one of:
|
| 89 |
<pre>
|
| 90 |
string (matched piece is copied to string)
|
| 91 |
StringPiece (StringPiece is mutated to point to matched piece)
|
| 92 |
T (where "bool T::ParseFrom(const char*, int)" exists)
|
| 93 |
NULL (the corresponding matched sub-pattern is not copied)
|
| 94 |
</pre>
|
| 95 |
The function returns true iff all of the following conditions are satisfied:
|
| 96 |
<pre>
|
| 97 |
a. "text" matches "pattern" exactly;
|
| 98 |
|
| 99 |
b. The number of matched sub-patterns is >= number of supplied
|
| 100 |
pointers;
|
| 101 |
|
| 102 |
c. The "i"th argument has a suitable type for holding the
|
| 103 |
string captured as the "i"th sub-pattern. If you pass in
|
| 104 |
void * NULL for the "i"th argument, or a non-void * NULL
|
| 105 |
of the correct type, or pass fewer arguments than the
|
| 106 |
number of sub-patterns, "i"th captured sub-pattern is
|
| 107 |
ignored.
|
| 108 |
</pre>
|
| 109 |
CAVEAT: An optional sub-pattern that does not exist in the matched
|
| 110 |
string is assigned the empty string. Therefore, the following will
|
| 111 |
return false (because the empty string is not a valid number):
|
| 112 |
<pre>
|
| 113 |
int number;
|
| 114 |
pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
|
| 115 |
</pre>
|
| 116 |
The matching interface supports at most 16 arguments per call.
|
| 117 |
If you need more, consider using the more general interface
|
| 118 |
<b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for
|
| 119 |
<b>DoMatch</b>.
|
| 120 |
</P>
|
| 121 |
<P>
|
| 122 |
NOTE: Do not use <b>no_arg</b>, which is used internally to mark the end of a
|
| 123 |
list of optional arguments, as a placeholder for missing arguments, as this can
|
| 124 |
lead to segfaults.
|
| 125 |
</P>
|
| 126 |
<br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br>
|
| 127 |
<P>
|
| 128 |
You can use the "QuoteMeta" operation to insert backslashes before all
|
| 129 |
potentially meaningful characters in a string. The returned string, used as a
|
| 130 |
regular expression, will exactly match the original string.
|
| 131 |
<pre>
|
| 132 |
Example:
|
| 133 |
string quoted = RE::QuoteMeta(unquoted);
|
| 134 |
</pre>
|
| 135 |
Note that it's legal to escape a character even if it has no special meaning in
|
| 136 |
a regular expression -- so this function does that. (This also makes it
|
| 137 |
identical to the perl function of the same name; see "perldoc -f quotemeta".)
|
| 138 |
For example, "1.5-2.0?" becomes "1\.5\-2\.0\?".
|
| 139 |
</P>
|
| 140 |
<br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br>
|
| 141 |
<P>
|
| 142 |
You can use the "PartialMatch" operation when you want the pattern
|
| 143 |
to match any substring of the text.
|
| 144 |
<pre>
|
| 145 |
Example: simple search for a string:
|
| 146 |
pcrecpp::RE("ell").PartialMatch("hello");
|
| 147 |
|
| 148 |
Example: find first number in a string:
|
| 149 |
int number;
|
| 150 |
pcrecpp::RE re("(\\d+)");
|
| 151 |
re.PartialMatch("x*100 + 20", &number);
|
| 152 |
assert(number == 100);
|
| 153 |
</PRE>
|
| 154 |
</P>
|
| 155 |
<br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br>
|
| 156 |
<P>
|
| 157 |
By default, pattern and text are plain text, one byte per character. The UTF8
|
| 158 |
flag, passed to the constructor, causes both pattern and string to be treated
|
| 159 |
as UTF-8 text, still a byte stream but potentially multiple bytes per
|
| 160 |
character. In practice, the text is likelier to be UTF-8 than the pattern, but
|
| 161 |
the match returned may depend on the UTF8 flag, so always use it when matching
|
| 162 |
UTF8 text. For example, "." will match one byte normally but with UTF8 set may
|
| 163 |
match up to three bytes of a multi-byte character.
|
| 164 |
<pre>
|
| 165 |
Example:
|
| 166 |
pcrecpp::RE_Options options;
|
| 167 |
options.set_utf8();
|
| 168 |
pcrecpp::RE re(utf8_pattern, options);
|
| 169 |
re.FullMatch(utf8_string);
|
| 170 |
|
| 171 |
Example: using the convenience function UTF8():
|
| 172 |
pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
|
| 173 |
re.FullMatch(utf8_string);
|
| 174 |
</pre>
|
| 175 |
NOTE: The UTF8 flag is ignored if pcre was not configured with the
|
| 176 |
<pre>
|
| 177 |
--enable-utf8 flag.
|
| 178 |
</PRE>
|
| 179 |
</P>
|
| 180 |
<br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br>
|
| 181 |
<P>
|
| 182 |
PCRE defines some modifiers to change the behavior of the regular expression
|
| 183 |
engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
|
| 184 |
pass such modifiers to a RE class. Currently, the following modifiers are
|
| 185 |
supported:
|
| 186 |
<pre>
|
| 187 |
modifier description Perl corresponding
|
| 188 |
|
| 189 |
PCRE_CASELESS case insensitive match /i
|
| 190 |
PCRE_MULTILINE multiple lines match /m
|
| 191 |
PCRE_DOTALL dot matches newlines /s
|
| 192 |
PCRE_DOLLAR_ENDONLY $ matches only at end N/A
|
| 193 |
PCRE_EXTRA strict escape parsing N/A
|
| 194 |
PCRE_EXTENDED ignore whitespaces /x
|
| 195 |
PCRE_UTF8 handles UTF8 chars built-in
|
| 196 |
PCRE_UNGREEDY reverses * and *? N/A
|
| 197 |
PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
|
| 198 |
</pre>
|
| 199 |
(*) Both Perl and PCRE allow non capturing parentheses by means of the
|
| 200 |
"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
|
| 201 |
capture, while (ab|cd) does.
|
| 202 |
</P>
|
| 203 |
<P>
|
| 204 |
For a full account on how each modifier works, please check the
|
| 205 |
PCRE API reference page.
|
| 206 |
</P>
|
| 207 |
<P>
|
| 208 |
For each modifier, there are two member functions whose name is made
|
| 209 |
out of the modifier in lowercase, without the "PCRE_" prefix. For
|
| 210 |
instance, PCRE_CASELESS is handled by
|
| 211 |
<pre>
|
| 212 |
bool caseless()
|
| 213 |
</pre>
|
| 214 |
which returns true if the modifier is set, and
|
| 215 |
<pre>
|
| 216 |
RE_Options & set_caseless(bool)
|
| 217 |
</pre>
|
| 218 |
which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
|
| 219 |
accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member
|
| 220 |
functions. Setting <i>match_limit</i> to a non-zero value will limit the
|
| 221 |
execution of pcre to keep it from doing bad things like blowing the stack or
|
| 222 |
taking an eternity to return a result. A value of 5000 is good enough to stop
|
| 223 |
stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables
|
| 224 |
match limiting. Alternatively, you can call <b>match_limit_recursion()</b>
|
| 225 |
which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
|
| 226 |
recurses. <b>match_limit()</b> limits the number of matches PCRE does;
|
| 227 |
<b>match_limit_recursion()</b> limits the depth of internal recursion, and
|
| 228 |
therefore the amount of stack that is used.
|
| 229 |
</P>
|
| 230 |
<P>
|
| 231 |
Normally, to pass one or more modifiers to a RE class, you declare
|
| 232 |
a <i>RE_Options</i> object, set the appropriate options, and pass this
|
| 233 |
object to a RE constructor. Example:
|
| 234 |
<pre>
|
| 235 |
RE_options opt;
|
| 236 |
opt.set_caseless(true);
|
| 237 |
if (RE("HELLO", opt).PartialMatch("hello world")) ...
|
| 238 |
</pre>
|
| 239 |
RE_options has two constructors. The default constructor takes no arguments and
|
| 240 |
creates a set of flags that are off by default. The optional parameter
|
| 241 |
<i>option_flags</i> is to facilitate transfer of legacy code from C programs.
|
| 242 |
This lets you do
|
| 243 |
<pre>
|
| 244 |
RE(pattern,
|
| 245 |
RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
|
| 246 |
</pre>
|
| 247 |
However, new code is better off doing
|
| 248 |
<pre>
|
| 249 |
RE(pattern,
|
| 250 |
RE_Options().set_caseless(true).set_multiline(true))
|
| 251 |
.PartialMatch(str);
|
| 252 |
</pre>
|
| 253 |
If you are going to pass one of the most used modifiers, there are some
|
| 254 |
convenience functions that return a RE_Options class with the
|
| 255 |
appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>,
|
| 256 |
<b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>.
|
| 257 |
</P>
|
| 258 |
<P>
|
| 259 |
If you need to set several options at once, and you don't want to go through
|
| 260 |
the pains of declaring a RE_Options object and setting several options, there
|
| 261 |
is a parallel method that give you such ability on the fly. You can concatenate
|
| 262 |
several <b>set_xxxxx()</b> member functions, since each of them returns a
|
| 263 |
reference to its class object. For example, to pass PCRE_CASELESS,
|
| 264 |
PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
|
| 265 |
<pre>
|
| 266 |
RE(" ^ xyz \\s+ .* blah$",
|
| 267 |
RE_Options()
|
| 268 |
.set_caseless(true)
|
| 269 |
.set_extended(true)
|
| 270 |
.set_multiline(true)).PartialMatch(sometext);
|
| 271 |
|
| 272 |
</PRE>
|
| 273 |
</P>
|
| 274 |
<br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br>
|
| 275 |
<P>
|
| 276 |
The "Consume" operation may be useful if you want to repeatedly
|
| 277 |
match regular expressions at the front of a string and skip over
|
| 278 |
them as they match. This requires use of the "StringPiece" type,
|
| 279 |
which represents a sub-range of a real string. Like RE, StringPiece
|
| 280 |
is defined in the pcrecpp namespace.
|
| 281 |
<pre>
|
| 282 |
Example: read lines of the form "var = value" from a string.
|
| 283 |
string contents = ...; // Fill string somehow
|
| 284 |
pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
|
| 285 |
|
| 286 |
string var;
|
| 287 |
int value;
|
| 288 |
pcrecpp::RE re("(\\w+) = (\\d+)\n");
|
| 289 |
while (re.Consume(&input, &var, &value)) {
|
| 290 |
...;
|
| 291 |
}
|
| 292 |
</pre>
|
| 293 |
Each successful call to "Consume" will set "var/value", and also
|
| 294 |
advance "input" so it points past the matched text.
|
| 295 |
</P>
|
| 296 |
<P>
|
| 297 |
The "FindAndConsume" operation is similar to "Consume" but does not
|
| 298 |
anchor your match at the beginning of the string. For example, you
|
| 299 |
could extract all words from a string by repeatedly calling
|
| 300 |
<pre>
|
| 301 |
pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
|
| 302 |
</PRE>
|
| 303 |
</P>
|
| 304 |
<br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br>
|
| 305 |
<P>
|
| 306 |
By default, if you pass a pointer to a numeric value, the
|
| 307 |
corresponding text is interpreted as a base-10 number. You can
|
| 308 |
instead wrap the pointer with a call to one of the operators Hex(),
|
| 309 |
Octal(), or CRadix() to interpret the text in another base. The
|
| 310 |
CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
|
| 311 |
prefixes, but defaults to base-10.
|
| 312 |
<pre>
|
| 313 |
Example:
|
| 314 |
int a, b, c, d;
|
| 315 |
pcrecpp::RE re("(.*) (.*) (.*) (.*)");
|
| 316 |
re.FullMatch("100 40 0100 0x40",
|
| 317 |
pcrecpp::Octal(&a), pcrecpp::Hex(&b),
|
| 318 |
pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
|
| 319 |
</pre>
|
| 320 |
will leave 64 in a, b, c, and d.
|
| 321 |
</P>
|
| 322 |
<br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br>
|
| 323 |
<P>
|
| 324 |
You can replace the first match of "pattern" in "str" with "rewrite".
|
| 325 |
Within "rewrite", backslash-escaped digits (\1 to \9) can be
|
| 326 |
used to insert text matching corresponding parenthesized group
|
| 327 |
from the pattern. \0 in "rewrite" refers to the entire matching
|
| 328 |
text. For example:
|
| 329 |
<pre>
|
| 330 |
string s = "yabba dabba doo";
|
| 331 |
pcrecpp::RE("b+").Replace("d", &s);
|
| 332 |
</pre>
|
| 333 |
will leave "s" containing "yada dabba doo". The result is true if the pattern
|
| 334 |
matches and a replacement occurs, false otherwise.
|
| 335 |
</P>
|
| 336 |
<P>
|
| 337 |
<b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all
|
| 338 |
occurrences of the pattern in the string with the rewrite. Replacements are
|
| 339 |
not subject to re-matching. For example:
|
| 340 |
<pre>
|
| 341 |
string s = "yabba dabba doo";
|
| 342 |
pcrecpp::RE("b+").GlobalReplace("d", &s);
|
| 343 |
</pre>
|
| 344 |
will leave "s" containing "yada dada doo". It returns the number of
|
| 345 |
replacements made.
|
| 346 |
</P>
|
| 347 |
<P>
|
| 348 |
<b>Extract</b> is like <b>Replace</b>, except that if the pattern matches,
|
| 349 |
"rewrite" is copied into "out" (an additional argument) with substitutions.
|
| 350 |
The non-matching portions of "text" are ignored. Returns true iff a match
|
| 351 |
occurred and the extraction happened successfully; if no match occurs, the
|
| 352 |
string is left unaffected.
|
| 353 |
</P>
|
| 354 |
<br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
|
| 355 |
<P>
|
| 356 |
The C++ wrapper was contributed by Google Inc.
|
| 357 |
<br>
|
| 358 |
Copyright © 2007 Google Inc.
|
| 359 |
<br>
|
| 360 |
</P>
|
| 361 |
<br><a name="SEC12" href="#TOC1">REVISION</a><br>
|
| 362 |
<P>
|
| 363 |
Last updated: 17 March 2009
|
| 364 |
<br>
|
| 365 |
<p>
|
| 366 |
Return to the <a href="index.html">PCRE index page</a>.
|
| 367 |
</p>
|