| 66 |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
| 67 |
differences (see below). The current implementation corresponds to Perl 5.005. |
differences (see below). The current implementation corresponds to Perl 5.005. |
| 68 |
|
|
| 69 |
PCRE has its own native API, which is described in this man page. There is also |
PCRE has its own native API, which is described in this document. There is also |
| 70 |
a set of wrapper functions that correspond to the POSIX API. See |
a set of wrapper functions that correspond to the POSIX API. These are |
| 71 |
\fBpcreposix (3)\fR. |
described in the \fBpcreposix\fR documentation. |
| 72 |
|
|
| 73 |
|
The native API function prototypes are defined in the header file \fBpcre.h\fR, |
| 74 |
|
and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be |
| 75 |
|
accessed by adding \fB-lpcre\fR to the command for linking an application which |
| 76 |
|
calls it. |
| 77 |
|
|
| 78 |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
| 79 |
are used for compiling and matching regular expressions, while |
are used for compiling and matching regular expressions, while |
| 242 |
|
|
| 243 |
An alternative set of tables can, however, be supplied. Such tables are built |
An alternative set of tables can, however, be supplied. Such tables are built |
| 244 |
by calling the \fBpcre_maketables()\fR function, which has no arguments, in the |
by calling the \fBpcre_maketables()\fR function, which has no arguments, in the |
| 245 |
relevant locale. The result can then be passed to \fBpcre_compile()\ as often |
relevant locale. The result can then be passed to \fBpcre_compile()\fR as often |
| 246 |
as necessary. For example, to build and use tables that are appropriate for the |
as necessary. For example, to build and use tables that are appropriate for the |
| 247 |
French locale (where accented characters with codes greater than 128 are |
French locale (where accented characters with codes greater than 128 are |
| 248 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
| 281 |
(cat|cow|coyote), then it is returned in the integer pointed to by |
(cat|cow|coyote), then it is returned in the integer pointed to by |
| 282 |
\fIfirstcharptr\fR. Otherwise, if either |
\fIfirstcharptr\fR. Otherwise, if either |
| 283 |
|
|
| 284 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
| 285 |
starts with "^", or |
starts with "^", or |
| 286 |
|
|
| 287 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
| 288 |
(if it were set, the pattern would be anchored), |
(if it were set, the pattern would be anchored), |
| 289 |
|
|
| 290 |
then -1 is returned, indicating that the pattern matches only at the |
then -1 is returned, indicating that the pattern matches only at the |
| 291 |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
| 303 |
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it |
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it |
| 304 |
cannot be made unachored at matching time. |
cannot be made unachored at matching time. |
| 305 |
|
|
| 306 |
There are also two further options that can be set only at matching time: |
There are also three further options that can be set only at matching time: |
| 307 |
|
|
| 308 |
PCRE_NOTBOL |
PCRE_NOTBOL |
| 309 |
|
|
| 318 |
it. Setting this without PCRE_MULTILINE (at compile time) causes dollar never |
it. Setting this without PCRE_MULTILINE (at compile time) causes dollar never |
| 319 |
to match. |
to match. |
| 320 |
|
|
| 321 |
|
PCRE_NOTEMPTY |
| 322 |
|
|
| 323 |
|
An empty string is not considered to be a valid match if this option is set. If |
| 324 |
|
there are alternatives in the pattern, they are tried. If all the alternatives |
| 325 |
|
match the empty string, the entire match fails. For example, if the pattern |
| 326 |
|
|
| 327 |
|
a?b? |
| 328 |
|
|
| 329 |
|
is applied to a string not beginning with "a" or "b", it matches the empty |
| 330 |
|
string at the start of the subject. With PCRE_NOTEMPTY set, this match is not |
| 331 |
|
valid, so PCRE searches further into the string for occurrences of "a" or "b". |
| 332 |
|
Perl has no direct equivalent of this option, but it makes a special case of |
| 333 |
|
a pattern match of the empty string within its \fBsplit()\fR function. Using |
| 334 |
|
PCRE_NOTEMPTY it is possible to emulate this behaviour. |
| 335 |
|
|
| 336 |
The subject string is passed as a pointer in \fIsubject\fR, a length in |
The subject string is passed as a pointer in \fIsubject\fR, a length in |
| 337 |
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern |
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern |
| 338 |
string, it may contain binary zero characters. When the starting offset is |
string, it may contain binary zero characters. When the starting offset is |
| 592 |
inverted, that is, by default they are not greedy, but if followed by a |
inverted, that is, by default they are not greedy, but if followed by a |
| 593 |
question mark they are. |
question mark they are. |
| 594 |
|
|
| 595 |
|
(e) PCRE_ANCHORED can be used to force a pattern to be tried only at the start |
| 596 |
|
of the subject. |
| 597 |
|
|
| 598 |
|
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for |
| 599 |
|
\fBpcre_exec()\fR have no Perl equivalents. |
| 600 |
|
|
| 601 |
|
|
| 602 |
.SH REGULAR EXPRESSION DETAILS |
.SH REGULAR EXPRESSION DETAILS |
| 603 |
The syntax and semantics of the regular expressions supported by PCRE are |
The syntax and semantics of the regular expressions supported by PCRE are |
| 1266 |
|
|
| 1267 |
(?<=\\d{3})(?<!999)foo |
(?<=\\d{3})(?<!999)foo |
| 1268 |
|
|
| 1269 |
matches "foo" preceded by three digits that are not "999". Furthermore, |
matches "foo" preceded by three digits that are not "999". Notice that each of |
| 1270 |
assertions can be nested in any combination. For example, |
the assertions is applied independently at the same point in the subject |
| 1271 |
|
string. First there is a check that the previous three characters are all |
| 1272 |
|
digits, then there is a check that the same three characters are not "999". |
| 1273 |
|
This pattern does \fInot\fR match "foo" preceded by six characters, the first |
| 1274 |
|
of which are digits and the last three of which are not "999". For example, it |
| 1275 |
|
doesn't match "123abcfoo". A pattern to do that is |
| 1276 |
|
|
| 1277 |
|
(?<=\\d{3}...)(?<!999)foo |
| 1278 |
|
|
| 1279 |
|
This time the first assertion looks at the preceding six characters, checking |
| 1280 |
|
that the first three are digits, and then the second assertion checks that the |
| 1281 |
|
preceding three characters are not "999". |
| 1282 |
|
|
| 1283 |
|
Assertions can be nested in any combination. For example, |
| 1284 |
|
|
| 1285 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
| 1286 |
|
|
| 1287 |
matches an occurrence of "baz" that is preceded by "bar" which in turn is not |
matches an occurrence of "baz" that is preceded by "bar" which in turn is not |
| 1288 |
preceded by "foo". |
preceded by "foo", while |
| 1289 |
|
|
| 1290 |
|
(?<=\\d{3}(?!999)...)foo |
| 1291 |
|
|
| 1292 |
|
is another pattern which matches "foo" preceded by three digits and any three |
| 1293 |
|
characters that are not "999". |
| 1294 |
|
|
| 1295 |
Assertion subpatterns are not capturing subpatterns, and may not be repeated, |
Assertion subpatterns are not capturing subpatterns, and may not be repeated, |
| 1296 |
because it makes no sense to assert the same thing several times. If any kind |
because it makes no sense to assert the same thing several times. If any kind |
| 1442 |
string contains newlines, the pattern may match from the character immediately |
string contains newlines, the pattern may match from the character immediately |
| 1443 |
following one of them instead of from the very start. For example, the pattern |
following one of them instead of from the very start. For example, the pattern |
| 1444 |
|
|
| 1445 |
(.*) second |
(.*) second |
| 1446 |
|
|
| 1447 |
matches the subject "first\\nand second" (where \\n stands for a newline |
matches the subject "first\\nand second" (where \\n stands for a newline |
| 1448 |
character) with the first captured substring being "and". In order to do this, |
character) with the first captured substring being "and". In order to do this, |
| 1453 |
the pattern with ^.* to indicate explicit anchoring. That saves PCRE from |
the pattern with ^.* to indicate explicit anchoring. That saves PCRE from |
| 1454 |
having to scan along the subject looking for a newline to restart at. |
having to scan along the subject looking for a newline to restart at. |
| 1455 |
|
|
| 1456 |
|
Beware of patterns that contain nested indefinite repeats. These can take a |
| 1457 |
|
long time to run when applied to a string that does not match. Consider the |
| 1458 |
|
pattern fragment |
| 1459 |
|
|
| 1460 |
|
(a+)* |
| 1461 |
|
|
| 1462 |
|
This can match "aaaa" in 33 different ways, and this number increases very |
| 1463 |
|
rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 |
| 1464 |
|
times, and for each of those cases other than 0, the + repeats can match |
| 1465 |
|
different numbers of times.) When the remainder of the pattern is such that the |
| 1466 |
|
entire match is going to fail, PCRE has in principle to try every possible |
| 1467 |
|
variation, and this can take an extremely long time. |
| 1468 |
|
|
| 1469 |
|
An optimization catches some of the more simple cases such as |
| 1470 |
|
|
| 1471 |
|
(a+)*b |
| 1472 |
|
|
| 1473 |
|
where a literal character follows. Before embarking on the standard matching |
| 1474 |
|
procedure, PCRE checks that there is a "b" later in the subject string, and if |
| 1475 |
|
there is not, it fails the match immediately. However, when there is no |
| 1476 |
|
following literal this optimization cannot be used. You can see the difference |
| 1477 |
|
by comparing the behaviour of |
| 1478 |
|
|
| 1479 |
|
(a+)*\\d |
| 1480 |
|
|
| 1481 |
|
with the pattern above. The former gives a failure almost instantly when |
| 1482 |
|
applied to a whole line of "a" characters, whereas the latter takes an |
| 1483 |
|
appreciable time with strings longer than about 20 characters. |
| 1484 |
|
|
| 1485 |
.SH AUTHOR |
.SH AUTHOR |
| 1486 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
| 1487 |
.br |
.br |
| 1493 |
.br |
.br |
| 1494 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
| 1495 |
|
|
| 1496 |
Last updated: 10 June 1999 |
Last updated: 29 July 1999 |
| 1497 |
.br |
.br |
| 1498 |
Copyright (c) 1997-1999 University of Cambridge. |
Copyright (c) 1997-1999 University of Cambridge. |