| 52 |
instead of recognizing only characters with codes less than 128 via a lookup |
instead of recognizing only characters with codes less than 128 via a lookup |
| 53 |
table. |
table. |
| 54 |
.P |
.P |
| 55 |
|
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
| 56 |
|
PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are |
| 57 |
|
also some more of these special sequences that are concerned with the handling |
| 58 |
|
of newlines; they are described below. |
| 59 |
|
.P |
| 60 |
The remainder of this document discusses the patterns that are supported by |
The remainder of this document discusses the patterns that are supported by |
| 61 |
PCRE when its main matching function, \fBpcre_exec()\fP, is used. |
PCRE when its main matching function, \fBpcre_exec()\fP, is used. |
| 62 |
From release 6.0, PCRE offers a second matching function, |
From release 6.0, PCRE offers a second matching function, |
| 189 |
The backslash character has several uses. Firstly, if it is followed by a |
The backslash character has several uses. Firstly, if it is followed by a |
| 190 |
character that is not a number or a letter, it takes away any special meaning |
character that is not a number or a letter, it takes away any special meaning |
| 191 |
that character may have. This use of backslash as an escape character applies |
that character may have. This use of backslash as an escape character applies |
| 192 |
both inside and outside character classes. |
both inside and outside character classes. |
| 193 |
.P |
.P |
| 194 |
For example, if you want to match a * character, you write \e* in the pattern. |
For example, if you want to match a * character, you write \e* in the pattern. |
| 195 |
This escaping action applies whether or not the following character would |
This escaping action applies whether or not the following character would |
| 198 |
particular, if you want to match a backslash, you write \e\e. |
particular, if you want to match a backslash, you write \e\e. |
| 199 |
.P |
.P |
| 200 |
In UTF-8 mode, only ASCII numbers and letters have any special meaning after a |
In UTF-8 mode, only ASCII numbers and letters have any special meaning after a |
| 201 |
backslash. All other characters (in particular, those whose codepoints are |
backslash. All other characters (in particular, those whose codepoints are |
| 202 |
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
| 203 |
.P |
.P |
| 204 |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the |
| 220 |
\eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz |
\eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz |
| 221 |
.sp |
.sp |
| 222 |
The \eQ...\eE sequence is recognized both inside and outside character classes. |
The \eQ...\eE sequence is recognized both inside and outside character classes. |
| 223 |
An isolated \eE that is not preceded by \eQ is ignored. |
An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed |
| 224 |
|
by \eE later in the pattern, the literal interpretation continues to the end of |
| 225 |
|
the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside |
| 226 |
|
a character class, this causes an error, because the character class is not |
| 227 |
|
terminated. |
| 228 |
. |
. |
| 229 |
. |
. |
| 230 |
.\" HTML <a name="digitsafterbackslash"></a> |
.\" HTML <a name="digitsafterbackslash"></a> |
| 251 |
The precise effect of \ecx is as follows: if x is a lower case letter, it |
The precise effect of \ecx is as follows: if x is a lower case letter, it |
| 252 |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
| 253 |
Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while |
Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while |
| 254 |
\ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater |
\ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater |
| 255 |
than 127, a compile-time error occurs. This locks out non-ASCII characters in |
than 127, a compile-time error occurs. This locks out non-ASCII characters in |
| 256 |
both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte |
both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte |
| 257 |
values are valid. A lower case letter is converted to upper case, and then the |
values are valid. A lower case letter is converted to upper case, and then the |
| 258 |
0xc0 bits are flipped.) |
0xc0 bits are flipped.) |
| 259 |
.P |
.P |
| 260 |
After \ex, from zero to two hexadecimal digits are read (letters can be in |
After \ex, from zero to two hexadecimal digits are read (letters can be in |
| 441 |
\eB because they are defined in terms of \ew and \eW. Matching these sequences |
\eB because they are defined in terms of \ew and \eW. Matching these sequences |
| 442 |
is noticeably slower when PCRE_UCP is set. |
is noticeably slower when PCRE_UCP is set. |
| 443 |
.P |
.P |
| 444 |
The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at |
The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at |
| 445 |
release 5.10. In contrast to the other sequences, which match only ASCII |
release 5.10. In contrast to the other sequences, which match only ASCII |
| 446 |
characters by default, these always match certain high-valued codepoints in |
characters by default, these always match certain high-valued codepoints in |
| 447 |
UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters |
UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters |
| 757 |
preceding character. None of them have codepoints less than 256, so in |
preceding character. None of them have codepoints less than 256, so in |
| 758 |
non-UTF-8 mode \eX matches any one character. |
non-UTF-8 mode \eX matches any one character. |
| 759 |
.P |
.P |
| 760 |
|
Note that recent versions of Perl have changed \eX to match what Unicode calls |
| 761 |
|
an "extended grapheme cluster", which has a more complicated definition. |
| 762 |
|
.P |
| 763 |
Matching characters by Unicode property is not fast, because PCRE has to search |
Matching characters by Unicode property is not fast, because PCRE has to search |
| 764 |
a structure that contains data for over fifteen thousand characters. That is |
a structure that contains data for over fifteen thousand characters. That is |
| 765 |
why the traditional escape sequences such as \ed and \ew do not use Unicode |
why the traditional escape sequences such as \ed and \ew do not use Unicode |
| 967 |
dollar, the only relationship being that they both involve newlines. Dot has no |
dollar, the only relationship being that they both involve newlines. Dot has no |
| 968 |
special meaning in a character class. |
special meaning in a character class. |
| 969 |
.P |
.P |
| 970 |
The escape sequence \eN behaves like a dot, except that it is not affected by |
The escape sequence \eN behaves like a dot, except that it is not affected by |
| 971 |
the PCRE_DOTALL option. In other words, it matches any character except one |
the PCRE_DOTALL option. In other words, it matches any character except one |
| 972 |
that signifies the end of a line. |
that signifies the end of a line. |
| 973 |
. |
. |
| 1083 |
A circumflex can conveniently be used with the upper case character types to |
A circumflex can conveniently be used with the upper case character types to |
| 1084 |
specify a more restricted set of characters than the matching lower case type. |
specify a more restricted set of characters than the matching lower case type. |
| 1085 |
For example, the class [^\eW_] matches any letter or digit, but not underscore, |
For example, the class [^\eW_] matches any letter or digit, but not underscore, |
| 1086 |
whereas [\ew] includes underscore. A positive character class should be read as |
whereas [\ew] includes underscore. A positive character class should be read as |
| 1087 |
"something OR something OR ..." and a negative class as "NOT something AND NOT |
"something OR something OR ..." and a negative class as "NOT something AND NOT |
| 1088 |
something AND NOT ...". |
something AND NOT ...". |
| 1089 |
.P |
.P |
| 1090 |
The only metacharacters that are recognized in character classes are backslash, |
The only metacharacters that are recognized in character classes are backslash, |
| 1438 |
an escape such as \ed or \epL that matches a single character |
an escape such as \ed or \epL that matches a single character |
| 1439 |
a character class |
a character class |
| 1440 |
a back reference (see next section) |
a back reference (see next section) |
| 1441 |
a parenthesized subpattern (unless it is an assertion) |
a parenthesized subpattern (including assertions) |
| 1442 |
a recursive or "subroutine" call to a subpattern |
a recursive or "subroutine" call to a subpattern |
| 1443 |
.sp |
.sp |
| 1444 |
The general repetition quantifier specifies a minimum and maximum number of |
The general repetition quantifier specifies a minimum and maximum number of |
| 1829 |
that look behind it. An assertion subpattern is matched in the normal way, |
that look behind it. An assertion subpattern is matched in the normal way, |
| 1830 |
except that it does not cause the current matching position to be changed. |
except that it does not cause the current matching position to be changed. |
| 1831 |
.P |
.P |
| 1832 |
Assertion subpatterns are not capturing subpatterns, and may not be repeated, |
Assertion subpatterns are not capturing subpatterns. If such an assertion |
| 1833 |
because it makes no sense to assert the same thing several times. If any kind |
contains capturing subpatterns within it, these are counted for the purposes of |
| 1834 |
of assertion contains capturing subpatterns within it, these are counted for |
numbering the capturing subpatterns in the whole pattern. However, substring |
| 1835 |
the purposes of numbering the capturing subpatterns in the whole pattern. |
capturing is carried out only for positive assertions, because it does not make |
| 1836 |
However, substring capturing is carried out only for positive assertions, |
sense for negative assertions. |
| 1837 |
because it does not make sense for negative assertions. |
.P |
| 1838 |
|
For compatibility with Perl, assertion subpatterns may be repeated; though |
| 1839 |
|
it makes no sense to assert the same thing several times, the side effect of |
| 1840 |
|
capturing parentheses may occasionally be useful. In practice, there only three |
| 1841 |
|
cases: |
| 1842 |
|
.sp |
| 1843 |
|
(1) If the quantifier is {0}, the assertion is never obeyed during matching. |
| 1844 |
|
However, it may contain internal capturing parenthesized groups that are called |
| 1845 |
|
from elsewhere via the |
| 1846 |
|
.\" HTML <a href="#subpatternsassubroutines"> |
| 1847 |
|
.\" </a> |
| 1848 |
|
subroutine mechanism. |
| 1849 |
|
.\" |
| 1850 |
|
.sp |
| 1851 |
|
(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it |
| 1852 |
|
were {0,1}. At run time, the rest of the pattern match is tried with and |
| 1853 |
|
without the assertion, the order depending on the greediness of the quantifier. |
| 1854 |
|
.sp |
| 1855 |
|
(3) If the minimum repetition is greater than zero, the quantifier is ignored. |
| 1856 |
|
The assertion is obeyed just once when encountered during matching. |
| 1857 |
. |
. |
| 1858 |
. |
. |
| 1859 |
.SS "Lookahead assertions" |
.SS "Lookahead assertions" |
| 2023 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
| 2024 |
no-pattern (if present) is used. If there are more than two alternatives in the |
no-pattern (if present) is used. If there are more than two alternatives in the |
| 2025 |
subpattern, a compile-time error occurs. Each of the two alternatives may |
subpattern, a compile-time error occurs. Each of the two alternatives may |
| 2026 |
itself contain nested subpatterns of any form, including conditional |
itself contain nested subpatterns of any form, including conditional |
| 2027 |
subpatterns; the restriction to two alternatives applies only at the level of |
subpatterns; the restriction to two alternatives applies only at the level of |
| 2028 |
the condition. This pattern fragment is an example where the alternatives are |
the condition. This pattern fragment is an example where the alternatives are |
| 2029 |
complex: |
complex: |
| 2030 |
.sp |
.sp |
| 2031 |
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
| 2050 |
to precede the digits with a plus or minus sign. In this case, the subpattern |
to precede the digits with a plus or minus sign. In this case, the subpattern |
| 2051 |
number is relative rather than absolute. The most recently opened parentheses |
number is relative rather than absolute. The most recently opened parentheses |
| 2052 |
can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside |
can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside |
| 2053 |
loops it can also make sense to refer to subsequent groups. The next |
loops it can also make sense to refer to subsequent groups. The next |
| 2054 |
parentheses to be opened can be referenced as (?(+1), and so on. (The value |
parentheses to be opened can be referenced as (?(+1), and so on. (The value |
| 2055 |
zero in any of these forms is not used; it provokes a compile-time error.) |
zero in any of these forms is not used; it provokes a compile-time error.) |
| 2056 |
.P |
.P |
| 2170 |
.SH COMMENTS |
.SH COMMENTS |
| 2171 |
.rs |
.rs |
| 2172 |
.sp |
.sp |
| 2173 |
There are two ways of including comments in patterns that are processed by |
There are two ways of including comments in patterns that are processed by |
| 2174 |
PCRE. In both cases, the start of the comment must not be in a character class, |
PCRE. In both cases, the start of the comment must not be in a character class, |
| 2175 |
nor in the middle of any other sequence of related characters such as (?: or a |
nor in the middle of any other sequence of related characters such as (?: or a |
| 2176 |
subpattern name or number. The characters that make up a comment play no part |
subpattern name or number. The characters that make up a comment play no part |
| 2194 |
.sp |
.sp |
| 2195 |
abc #comment \en still comment |
abc #comment \en still comment |
| 2196 |
.sp |
.sp |
| 2197 |
On encountering the # character, \fBpcre_compile()\fP skips along, looking for |
On encountering the # character, \fBpcre_compile()\fP skips along, looking for |
| 2198 |
a newline in the pattern. The sequence \en is still literal at this stage, so |
a newline in the pattern. The sequence \en is still literal at this stage, so |
| 2199 |
it does not terminate the comment. Only an actual character with the code value |
it does not terminate the comment. Only an actual character with the code value |
| 2200 |
0x0a (the default newline) does so. |
0x0a (the default newline) does so. |
| 2511 |
.P |
.P |
| 2512 |
If any of these verbs are used in an assertion or subroutine subpattern |
If any of these verbs are used in an assertion or subroutine subpattern |
| 2513 |
(including recursive subpatterns), their effect is confined to that subpattern; |
(including recursive subpatterns), their effect is confined to that subpattern; |
| 2514 |
it does not extend to the surrounding pattern. Note that such subpatterns are |
it does not extend to the surrounding pattern, with one exception: a *MARK that |
| 2515 |
processed as anchored at the point where they are tested. |
is encountered in a positive assertion \fIis\fP passed back (compare capturing |
| 2516 |
|
parentheses in assertions). Note that such subpatterns are processed as |
| 2517 |
|
anchored at the point where they are tested. |
| 2518 |
.P |
.P |
| 2519 |
The new verbs make use of what was previously invalid syntax: an opening |
The new verbs make use of what was previously invalid syntax: an opening |
| 2520 |
parenthesis followed by an asterisk. They are generally of the form |
parenthesis followed by an asterisk. They are generally of the form |
| 2530 |
present. When one of these optimizations suppresses the running of a match, any |
present. When one of these optimizations suppresses the running of a match, any |
| 2531 |
included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
| 2532 |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
| 2533 |
when calling \fBpcre_exec()\fP. |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
| 2534 |
|
pattern with (*NO_START_OPT). |
| 2535 |
. |
. |
| 2536 |
. |
. |
| 2537 |
.SS "Verbs that act immediately" |
.SS "Verbs that act immediately" |
| 2605 |
of obtaining this information than putting each alternative in its own |
of obtaining this information than putting each alternative in its own |
| 2606 |
capturing parentheses. |
capturing parentheses. |
| 2607 |
.P |
.P |
| 2608 |
|
If (*MARK) is encountered in a positive assertion, its name is recorded and |
| 2609 |
|
passed back if it is the last-encountered. This does not happen for negative |
| 2610 |
|
assetions. |
| 2611 |
|
.P |
| 2612 |
A name may also be returned after a failed match if the final path through the |
A name may also be returned after a failed match if the final path through the |
| 2613 |
pattern involves (*MARK). However, unless (*MARK) used in conjunction with |
pattern involves (*MARK). However, unless (*MARK) used in conjunction with |
| 2614 |
(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the |
(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the |
| 2726 |
.sp |
.sp |
| 2727 |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
| 2728 |
.sp |
.sp |
| 2729 |
This verb causes a skip to the next alternation in the innermost enclosing |
This verb causes a skip to the next alternation in the innermost enclosing |
| 2730 |
group if the rest of the pattern does not match. That is, it cancels pending |
group if the rest of the pattern does not match. That is, it cancels pending |
| 2731 |
backtracking, but only within the current alternation. Its name comes from the |
backtracking, but only within the current alternation. Its name comes from the |
| 2732 |
observation that it can be used for a pattern-based if-then-else block: |
observation that it can be used for a pattern-based if-then-else block: |
| 2741 |
like (*PRUNE). |
like (*PRUNE). |
| 2742 |
. |
. |
| 2743 |
.P |
.P |
| 2744 |
The above verbs provide four different "strengths" of control when subsequent |
The above verbs provide four different "strengths" of control when subsequent |
| 2745 |
matching fails. (*THEN) is the weakest, carrying on the match at the next |
matching fails. (*THEN) is the weakest, carrying on the match at the next |
| 2746 |
alternation. (*PRUNE) comes next, failing the match at the current starting |
alternation. (*PRUNE) comes next, failing the match at the current starting |
| 2747 |
position, but allowing an advance to the next character (for an unanchored |
position, but allowing an advance to the next character (for an unanchored |
| 2780 |
.rs |
.rs |
| 2781 |
.sp |
.sp |
| 2782 |
.nf |
.nf |
| 2783 |
Last updated: 21 November 2010 |
Last updated: 24 July 2011 |
| 2784 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
| 2785 |
.fi |
.fi |