| 1 |
nigel |
79 |
.TH PCRECPP 3 |
| 2 |
nigel |
77 |
.SH NAME |
| 3 |
|
|
PCRE - Perl-compatible regular expressions. |
| 4 |
|
|
.SH "SYNOPSIS OF C++ WRAPPER" |
| 5 |
|
|
.rs |
| 6 |
|
|
.sp |
| 7 |
|
|
.B #include <pcrecpp.h> |
| 8 |
|
|
.PP |
| 9 |
|
|
.SM |
| 10 |
|
|
.br |
| 11 |
|
|
.SH DESCRIPTION |
| 12 |
|
|
.rs |
| 13 |
|
|
.sp |
| 14 |
nigel |
81 |
The C++ wrapper for PCRE was provided by Google Inc. Some additional |
| 15 |
|
|
functionality was added by Giuseppe Maxia. This brief man page was constructed |
| 16 |
|
|
from the notes in the \fIpcrecpp.h\fP file, which should be consulted for |
| 17 |
|
|
further details. |
| 18 |
nigel |
77 |
. |
| 19 |
|
|
. |
| 20 |
|
|
.SH "MATCHING INTERFACE" |
| 21 |
|
|
.rs |
| 22 |
|
|
.sp |
| 23 |
|
|
The "FullMatch" operation checks that supplied text matches a supplied pattern |
| 24 |
|
|
exactly. If pointer arguments are supplied, it copies matched sub-strings that |
| 25 |
|
|
match sub-patterns into them. |
| 26 |
|
|
.sp |
| 27 |
|
|
Example: successful match |
| 28 |
|
|
pcrecpp::RE re("h.*o"); |
| 29 |
|
|
re.FullMatch("hello"); |
| 30 |
|
|
.sp |
| 31 |
|
|
Example: unsuccessful match (requires full match): |
| 32 |
|
|
pcrecpp::RE re("e"); |
| 33 |
|
|
!re.FullMatch("hello"); |
| 34 |
|
|
.sp |
| 35 |
|
|
Example: creating a temporary RE object: |
| 36 |
|
|
pcrecpp::RE("h.*o").FullMatch("hello"); |
| 37 |
|
|
.sp |
| 38 |
|
|
You can pass in a "const char*" or a "string" for "text". The examples below |
| 39 |
|
|
tend to use a const char*. You can, as in the different examples above, store |
| 40 |
|
|
the RE object explicitly in a variable or use a temporary RE object. The |
| 41 |
|
|
examples below use one mode or the other arbitrarily. Either could correctly be |
| 42 |
|
|
used for any of these examples. |
| 43 |
|
|
.P |
| 44 |
|
|
You must supply extra pointer arguments to extract matched subpieces. |
| 45 |
|
|
.sp |
| 46 |
|
|
Example: extracts "ruby" into "s" and 1234 into "i" |
| 47 |
|
|
int i; |
| 48 |
|
|
string s; |
| 49 |
|
|
pcrecpp::RE re("(\e\ew+):(\e\ed+)"); |
| 50 |
|
|
re.FullMatch("ruby:1234", &s, &i); |
| 51 |
|
|
.sp |
| 52 |
|
|
Example: does not try to extract any extra sub-patterns |
| 53 |
|
|
re.FullMatch("ruby:1234", &s); |
| 54 |
|
|
.sp |
| 55 |
|
|
Example: does not try to extract into NULL |
| 56 |
|
|
re.FullMatch("ruby:1234", NULL, &i); |
| 57 |
|
|
.sp |
| 58 |
|
|
Example: integer overflow causes failure |
| 59 |
|
|
!re.FullMatch("ruby:1234567891234", NULL, &i); |
| 60 |
|
|
.sp |
| 61 |
|
|
Example: fails because there aren't enough sub-patterns: |
| 62 |
|
|
!pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s); |
| 63 |
|
|
.sp |
| 64 |
|
|
Example: fails because string cannot be stored in integer |
| 65 |
|
|
!pcrecpp::RE("(.*)").FullMatch("ruby", &i); |
| 66 |
|
|
.sp |
| 67 |
|
|
The provided pointer arguments can be pointers to any scalar numeric |
| 68 |
|
|
type, or one of: |
| 69 |
|
|
.sp |
| 70 |
|
|
string (matched piece is copied to string) |
| 71 |
|
|
StringPiece (StringPiece is mutated to point to matched piece) |
| 72 |
|
|
T (where "bool T::ParseFrom(const char*, int)" exists) |
| 73 |
|
|
NULL (the corresponding matched sub-pattern is not copied) |
| 74 |
|
|
.sp |
| 75 |
|
|
The function returns true iff all of the following conditions are satisfied: |
| 76 |
|
|
.sp |
| 77 |
|
|
a. "text" matches "pattern" exactly; |
| 78 |
|
|
.sp |
| 79 |
|
|
b. The number of matched sub-patterns is >= number of supplied |
| 80 |
|
|
pointers; |
| 81 |
|
|
.sp |
| 82 |
|
|
c. The "i"th argument has a suitable type for holding the |
| 83 |
|
|
string captured as the "i"th sub-pattern. If you pass in |
| 84 |
|
|
NULL for the "i"th argument, or pass fewer arguments than |
| 85 |
|
|
number of sub-patterns, "i"th captured sub-pattern is |
| 86 |
|
|
ignored. |
| 87 |
|
|
.sp |
| 88 |
|
|
The matching interface supports at most 16 arguments per call. |
| 89 |
|
|
If you need more, consider using the more general interface |
| 90 |
|
|
\fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for |
| 91 |
|
|
\fBDoMatch\fP. |
| 92 |
|
|
. |
| 93 |
|
|
.SH "PARTIAL MATCHES" |
| 94 |
|
|
.rs |
| 95 |
|
|
.sp |
| 96 |
|
|
You can use the "PartialMatch" operation when you want the pattern |
| 97 |
|
|
to match any substring of the text. |
| 98 |
|
|
.sp |
| 99 |
|
|
Example: simple search for a string: |
| 100 |
|
|
pcrecpp::RE("ell").PartialMatch("hello"); |
| 101 |
|
|
.sp |
| 102 |
|
|
Example: find first number in a string: |
| 103 |
|
|
int number; |
| 104 |
|
|
pcrecpp::RE re("(\e\ed+)"); |
| 105 |
|
|
re.PartialMatch("x*100 + 20", &number); |
| 106 |
|
|
assert(number == 100); |
| 107 |
|
|
. |
| 108 |
|
|
. |
| 109 |
|
|
.SH "UTF-8 AND THE MATCHING INTERFACE" |
| 110 |
|
|
.rs |
| 111 |
|
|
.sp |
| 112 |
|
|
By default, pattern and text are plain text, one byte per character. The UTF8 |
| 113 |
|
|
flag, passed to the constructor, causes both pattern and string to be treated |
| 114 |
|
|
as UTF-8 text, still a byte stream but potentially multiple bytes per |
| 115 |
|
|
character. In practice, the text is likelier to be UTF-8 than the pattern, but |
| 116 |
|
|
the match returned may depend on the UTF8 flag, so always use it when matching |
| 117 |
|
|
UTF8 text. For example, "." will match one byte normally but with UTF8 set may |
| 118 |
|
|
match up to three bytes of a multi-byte character. |
| 119 |
|
|
.sp |
| 120 |
|
|
Example: |
| 121 |
|
|
pcrecpp::RE_Options options; |
| 122 |
|
|
options.set_utf8(); |
| 123 |
|
|
pcrecpp::RE re(utf8_pattern, options); |
| 124 |
|
|
re.FullMatch(utf8_string); |
| 125 |
|
|
.sp |
| 126 |
|
|
Example: using the convenience function UTF8(): |
| 127 |
|
|
pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8()); |
| 128 |
|
|
re.FullMatch(utf8_string); |
| 129 |
|
|
.sp |
| 130 |
|
|
NOTE: The UTF8 flag is ignored if pcre was not configured with the |
| 131 |
|
|
--enable-utf8 flag. |
| 132 |
|
|
. |
| 133 |
|
|
. |
| 134 |
nigel |
81 |
.SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE" |
| 135 |
|
|
.rs |
| 136 |
|
|
.sp |
| 137 |
|
|
PCRE defines some modifiers to change the behavior of the regular expression |
| 138 |
|
|
engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to |
| 139 |
|
|
pass such modifiers to a RE class. Currently, the following modifiers are |
| 140 |
|
|
supported: |
| 141 |
|
|
.sp |
| 142 |
|
|
modifier description Perl corresponding |
| 143 |
|
|
.sp |
| 144 |
|
|
PCRE_CASELESS case insensitive match /i |
| 145 |
|
|
PCRE_MULTILINE multiple lines match /m |
| 146 |
|
|
PCRE_DOTALL dot matches newlines /s |
| 147 |
|
|
PCRE_DOLLAR_ENDONLY $ matches only at end N/A |
| 148 |
|
|
PCRE_EXTRA strict escape parsing N/A |
| 149 |
|
|
PCRE_EXTENDED ignore whitespaces /x |
| 150 |
|
|
PCRE_UTF8 handles UTF8 chars built-in |
| 151 |
|
|
PCRE_UNGREEDY reverses * and *? N/A |
| 152 |
|
|
PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) |
| 153 |
|
|
.sp |
| 154 |
|
|
(*) Both Perl and PCRE allow non capturing parentheses by means of the |
| 155 |
|
|
"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not |
| 156 |
|
|
capture, while (ab|cd) does. |
| 157 |
|
|
.P |
| 158 |
|
|
For a full account on how each modifier works, please check the |
| 159 |
|
|
PCRE API reference page. |
| 160 |
|
|
.P |
| 161 |
|
|
For each modifier, there are two member functions whose name is made |
| 162 |
|
|
out of the modifier in lowercase, without the "PCRE_" prefix. For |
| 163 |
|
|
instance, PCRE_CASELESS is handled by |
| 164 |
|
|
.sp |
| 165 |
|
|
bool caseless() |
| 166 |
|
|
.sp |
| 167 |
|
|
which returns true if the modifier is set, and |
| 168 |
|
|
.sp |
| 169 |
|
|
RE_Options & set_caseless(bool) |
| 170 |
|
|
.sp |
| 171 |
nigel |
87 |
which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be |
| 172 |
nigel |
81 |
accessed through the \fBset_match_limit()\fR and \fBmatch_limit()\fR member |
| 173 |
|
|
functions. Setting \fImatch_limit\fR to a non-zero value will limit the |
| 174 |
|
|
execution of pcre to keep it from doing bad things like blowing the stack or |
| 175 |
|
|
taking an eternity to return a result. A value of 5000 is good enough to stop |
| 176 |
|
|
stack blowup in a 2MB thread stack. Setting \fImatch_limit\fR to zero disables |
| 177 |
nigel |
87 |
match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP |
| 178 |
|
|
which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE |
| 179 |
|
|
recurses. \fBmatch_limit()\fP limits the number of matches PCRE does; |
| 180 |
|
|
\fBmatch_limit_recursion()\fP limits the depth of internal recursion, and |
| 181 |
|
|
therefore the amount of stack that is used. |
| 182 |
nigel |
81 |
.P |
| 183 |
|
|
Normally, to pass one or more modifiers to a RE class, you declare |
| 184 |
|
|
a \fIRE_Options\fR object, set the appropriate options, and pass this |
| 185 |
|
|
object to a RE constructor. Example: |
| 186 |
|
|
.sp |
| 187 |
|
|
RE_options opt; |
| 188 |
|
|
opt.set_caseless(true); |
| 189 |
|
|
if (RE("HELLO", opt).PartialMatch("hello world")) ... |
| 190 |
|
|
.sp |
| 191 |
|
|
RE_options has two constructors. The default constructor takes no arguments and |
| 192 |
|
|
creates a set of flags that are off by default. The optional parameter |
| 193 |
|
|
\fIoption_flags\fR is to facilitate transfer of legacy code from C programs. |
| 194 |
|
|
This lets you do |
| 195 |
|
|
.sp |
| 196 |
|
|
RE(pattern, |
| 197 |
|
|
RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str); |
| 198 |
|
|
.sp |
| 199 |
|
|
However, new code is better off doing |
| 200 |
|
|
.sp |
| 201 |
|
|
RE(pattern, |
| 202 |
|
|
RE_Options().set_caseless(true).set_multiline(true)) |
| 203 |
|
|
.PartialMatch(str); |
| 204 |
|
|
.sp |
| 205 |
|
|
If you are going to pass one of the most used modifiers, there are some |
| 206 |
|
|
convenience functions that return a RE_Options class with the |
| 207 |
|
|
appropriate modifier already set: \fBCASELESS()\fR, \fBUTF8()\fR, |
| 208 |
|
|
\fBMULTILINE()\fR, \fBDOTALL\fR(), and \fBEXTENDED()\fR. |
| 209 |
|
|
.P |
| 210 |
|
|
If you need to set several options at once, and you don't want to go through |
| 211 |
|
|
the pains of declaring a RE_Options object and setting several options, there |
| 212 |
|
|
is a parallel method that give you such ability on the fly. You can concatenate |
| 213 |
|
|
several \fBset_xxxxx()\fR member functions, since each of them returns a |
| 214 |
|
|
reference to its class object. For example, to pass PCRE_CASELESS, |
| 215 |
|
|
PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write: |
| 216 |
|
|
.sp |
| 217 |
|
|
RE(" ^ xyz \e\es+ .* blah$", |
| 218 |
|
|
RE_Options() |
| 219 |
|
|
.set_caseless(true) |
| 220 |
|
|
.set_extended(true) |
| 221 |
|
|
.set_multiline(true)).PartialMatch(sometext); |
| 222 |
|
|
.sp |
| 223 |
|
|
. |
| 224 |
|
|
. |
| 225 |
nigel |
77 |
.SH "SCANNING TEXT INCREMENTALLY" |
| 226 |
|
|
.rs |
| 227 |
|
|
.sp |
| 228 |
|
|
The "Consume" operation may be useful if you want to repeatedly |
| 229 |
|
|
match regular expressions at the front of a string and skip over |
| 230 |
|
|
them as they match. This requires use of the "StringPiece" type, |
| 231 |
|
|
which represents a sub-range of a real string. Like RE, StringPiece |
| 232 |
|
|
is defined in the pcrecpp namespace. |
| 233 |
|
|
.sp |
| 234 |
|
|
Example: read lines of the form "var = value" from a string. |
| 235 |
|
|
string contents = ...; // Fill string somehow |
| 236 |
|
|
pcrecpp::StringPiece input(contents); // Wrap in a StringPiece |
| 237 |
|
|
|
| 238 |
|
|
string var; |
| 239 |
|
|
int value; |
| 240 |
|
|
pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en"); |
| 241 |
|
|
while (re.Consume(&input, &var, &value)) { |
| 242 |
|
|
...; |
| 243 |
|
|
} |
| 244 |
|
|
.sp |
| 245 |
|
|
Each successful call to "Consume" will set "var/value", and also |
| 246 |
|
|
advance "input" so it points past the matched text. |
| 247 |
|
|
.P |
| 248 |
|
|
The "FindAndConsume" operation is similar to "Consume" but does not |
| 249 |
|
|
anchor your match at the beginning of the string. For example, you |
| 250 |
|
|
could extract all words from a string by repeatedly calling |
| 251 |
|
|
.sp |
| 252 |
|
|
pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word) |
| 253 |
|
|
. |
| 254 |
|
|
. |
| 255 |
|
|
.SH "PARSING HEX/OCTAL/C-RADIX NUMBERS" |
| 256 |
|
|
.rs |
| 257 |
|
|
.sp |
| 258 |
|
|
By default, if you pass a pointer to a numeric value, the |
| 259 |
|
|
corresponding text is interpreted as a base-10 number. You can |
| 260 |
|
|
instead wrap the pointer with a call to one of the operators Hex(), |
| 261 |
|
|
Octal(), or CRadix() to interpret the text in another base. The |
| 262 |
|
|
CRadix operator interprets C-style "0" (base-8) and "0x" (base-16) |
| 263 |
|
|
prefixes, but defaults to base-10. |
| 264 |
|
|
.sp |
| 265 |
|
|
Example: |
| 266 |
|
|
int a, b, c, d; |
| 267 |
|
|
pcrecpp::RE re("(.*) (.*) (.*) (.*)"); |
| 268 |
|
|
re.FullMatch("100 40 0100 0x40", |
| 269 |
|
|
pcrecpp::Octal(&a), pcrecpp::Hex(&b), |
| 270 |
|
|
pcrecpp::CRadix(&c), pcrecpp::CRadix(&d)); |
| 271 |
|
|
.sp |
| 272 |
|
|
will leave 64 in a, b, c, and d. |
| 273 |
|
|
. |
| 274 |
|
|
. |
| 275 |
|
|
.SH "REPLACING PARTS OF STRINGS" |
| 276 |
|
|
.rs |
| 277 |
|
|
.sp |
| 278 |
|
|
You can replace the first match of "pattern" in "str" with "rewrite". |
| 279 |
|
|
Within "rewrite", backslash-escaped digits (\e1 to \e9) can be |
| 280 |
|
|
used to insert text matching corresponding parenthesized group |
| 281 |
|
|
from the pattern. \e0 in "rewrite" refers to the entire matching |
| 282 |
|
|
text. For example: |
| 283 |
|
|
.sp |
| 284 |
|
|
string s = "yabba dabba doo"; |
| 285 |
|
|
pcrecpp::RE("b+").Replace("d", &s); |
| 286 |
|
|
.sp |
| 287 |
|
|
will leave "s" containing "yada dabba doo". The result is true if the pattern |
| 288 |
|
|
matches and a replacement occurs, false otherwise. |
| 289 |
|
|
.P |
| 290 |
|
|
\fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all |
| 291 |
|
|
occurrences of the pattern in the string with the rewrite. Replacements are |
| 292 |
|
|
not subject to re-matching. For example: |
| 293 |
|
|
.sp |
| 294 |
|
|
string s = "yabba dabba doo"; |
| 295 |
|
|
pcrecpp::RE("b+").GlobalReplace("d", &s); |
| 296 |
|
|
.sp |
| 297 |
|
|
will leave "s" containing "yada dada doo". It returns the number of |
| 298 |
|
|
replacements made. |
| 299 |
|
|
.P |
| 300 |
|
|
\fBExtract\fP is like \fBReplace\fP, except that if the pattern matches, |
| 301 |
|
|
"rewrite" is copied into "out" (an additional argument) with substitutions. |
| 302 |
|
|
The non-matching portions of "text" are ignored. Returns true iff a match |
| 303 |
|
|
occurred and the extraction happened successfully; if no match occurs, the |
| 304 |
|
|
string is left unaffected. |
| 305 |
|
|
. |
| 306 |
|
|
. |
| 307 |
|
|
.SH AUTHOR |
| 308 |
|
|
.rs |
| 309 |
|
|
.sp |
| 310 |
|
|
The C++ wrapper was contributed by Google Inc. |
| 311 |
|
|
.br |
| 312 |
|
|
Copyright (c) 2005 Google Inc. |