| 633 |
always 1, and the value of the capture_last field is always -1. |
always 1, and the value of the capture_last field is always -1. |
| 634 |
|
|
| 635 |
7. The \C escape sequence, which (in the standard algorithm) matches a |
7. The \C escape sequence, which (in the standard algorithm) matches a |
| 636 |
single byte, even in UTF-8 mode, is not supported because the alterna- |
single byte, even in UTF-8 mode, is not supported in UTF-8 mode, |
| 637 |
tive algorithm moves through the subject string one character at a |
because the alternative algorithm moves through the subject string one |
| 638 |
time, for all active paths through the tree. |
character at a time, for all active paths through the tree. |
| 639 |
|
|
| 640 |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
| 641 |
are not supported. (*FAIL) is supported, and behaves like a failing |
are not supported. (*FAIL) is supported, and behaves like a failing |
| 685 |
|
|
| 686 |
REVISION |
REVISION |
| 687 |
|
|
| 688 |
Last updated: 17 November 2010 |
Last updated: 19 November 2011 |
| 689 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
| 690 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 691 |
|
|
| 1256 |
set (assuming it can find an "a" in the subject), whereas it fails by |
set (assuming it can find an "a" in the subject), whereas it fails by |
| 1257 |
default, for Perl compatibility. |
default, for Perl compatibility. |
| 1258 |
|
|
| 1259 |
|
(3) \U matches an upper case "U" character; by default \U causes a com- |
| 1260 |
|
pile time error (Perl uses \U to upper case subsequent characters). |
| 1261 |
|
|
| 1262 |
|
(4) \u matches a lower case "u" character unless it is followed by four |
| 1263 |
|
hexadecimal digits, in which case the hexadecimal number defines the |
| 1264 |
|
code point to match. By default, \u causes a compile time error (Perl |
| 1265 |
|
uses it to upper case the following character). |
| 1266 |
|
|
| 1267 |
|
(5) \x matches a lower case "x" character unless it is followed by two |
| 1268 |
|
hexadecimal digits, in which case the hexadecimal number defines the |
| 1269 |
|
code point to match. By default, as in Perl, a hexadecimal number is |
| 1270 |
|
always expected after \x, but it may have zero, one, or two digits (so, |
| 1271 |
|
for example, \xz matches a binary zero character followed by z). |
| 1272 |
|
|
| 1273 |
PCRE_MULTILINE |
PCRE_MULTILINE |
| 1274 |
|
|
| 1275 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
| 1724 |
compiler could not handle this particular pattern. See the pcrejit doc- |
compiler could not handle this particular pattern. See the pcrejit doc- |
| 1725 |
umentation for details of what can and cannot be handled. |
umentation for details of what can and cannot be handled. |
| 1726 |
|
|
| 1727 |
|
PCRE_INFO_JITSIZE |
| 1728 |
|
|
| 1729 |
|
If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE |
| 1730 |
|
option, return the size of the JIT compiled code, otherwise return |
| 1731 |
|
zero. The fourth argument should point to a size_t variable. |
| 1732 |
|
|
| 1733 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
| 1734 |
|
|
| 1735 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
| 1838 |
|
|
| 1839 |
PCRE_INFO_SIZE |
PCRE_INFO_SIZE |
| 1840 |
|
|
| 1841 |
Return the size of the compiled pattern, that is, the value that was |
Return the size of the compiled pattern. The fourth argument should |
| 1842 |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
point to a size_t variable. This value does not include the size of the |
| 1843 |
which to place the compiled data. The fourth argument should point to a |
pcre structure that is returned by pcre_compile(). The value that is |
| 1844 |
size_t variable. |
passed as the argument to pcre_malloc() when pcre_compile() is getting |
| 1845 |
|
memory in which to place the compiled data is the value returned by |
| 1846 |
|
this option plus the size of the pcre structure. Studying a compiled |
| 1847 |
|
pattern, with or without JIT, does not alter the value returned by this |
| 1848 |
|
option. |
| 1849 |
|
|
| 1850 |
PCRE_INFO_STUDYSIZE |
PCRE_INFO_STUDYSIZE |
| 1851 |
|
|
| 3004 |
|
|
| 3005 |
REVISION |
REVISION |
| 3006 |
|
|
| 3007 |
Last updated: 23 September 2011 |
Last updated: 02 December 2011 |
| 3008 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
| 3009 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 3010 |
|
|
| 3167 |
|
|
| 3168 |
The mark field is present from version 2 of the pcre_callout structure. |
The mark field is present from version 2 of the pcre_callout structure. |
| 3169 |
In callouts from pcre_exec() it contains a pointer to the zero-termi- |
In callouts from pcre_exec() it contains a pointer to the zero-termi- |
| 3170 |
nated name of the most recently passed (*MARK) item in the match, or |
nated name of the most recently passed (*MARK), (*PRUNE), or (*THEN) |
| 3171 |
NULL if there are no (*MARK)s in the current matching path. In callouts |
item in the match, or NULL if no such items have been passed. Instances |
| 3172 |
from pcre_dfa_exec() this field always contains NULL. |
of (*PRUNE) or (*THEN) without a name do not obliterate a previous |
| 3173 |
|
(*MARK). In callouts from pcre_dfa_exec() this field always contains |
| 3174 |
|
NULL. |
| 3175 |
|
|
| 3176 |
|
|
| 3177 |
RETURN VALUES |
RETURN VALUES |
| 3199 |
|
|
| 3200 |
REVISION |
REVISION |
| 3201 |
|
|
| 3202 |
Last updated: 26 August 2011 |
Last updated: 30 November 2011 |
| 3203 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
| 3204 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 3205 |
|
|
| 3244 |
its own, matching a non-newline character, is supported.) In fact these |
its own, matching a non-newline character, is supported.) In fact these |
| 3245 |
are implemented by Perl's general string-handling and are not part of |
are implemented by Perl's general string-handling and are not part of |
| 3246 |
its pattern matching engine. If any of these are encountered by PCRE, |
its pattern matching engine. If any of these are encountered by PCRE, |
| 3247 |
an error is generated. |
an error is generated by default. However, if the PCRE_JAVASCRIPT_COM- |
| 3248 |
|
PAT option is set, \U and \u are interpreted as JavaScript interprets |
| 3249 |
|
them. |
| 3250 |
|
|
| 3251 |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
| 3252 |
is built with Unicode character property support. The properties that |
is built with Unicode character property support. The properties that |
| 3373 |
|
|
| 3374 |
REVISION |
REVISION |
| 3375 |
|
|
| 3376 |
Last updated: 09 October 2011 |
Last updated: 14 November 2011 |
| 3377 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
| 3378 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 3379 |
|
|
| 3600 |
\t tab (hex 09) |
\t tab (hex 09) |
| 3601 |
\ddd character with octal code ddd, or back reference |
\ddd character with octal code ddd, or back reference |
| 3602 |
\xhh character with hex code hh |
\xhh character with hex code hh |
| 3603 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
| 3604 |
|
\uhhhh character with hex code hhhh (JavaScript mode only) |
| 3605 |
|
|
| 3606 |
The precise effect of \cx is as follows: if x is a lower case letter, |
The precise effect of \cx is as follows: if x is a lower case letter, |
| 3607 |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
| 3612 |
is compiled in EBCDIC mode, all byte values are valid. A lower case |
is compiled in EBCDIC mode, all byte values are valid. A lower case |
| 3613 |
letter is converted to upper case, and then the 0xc0 bits are flipped.) |
letter is converted to upper case, and then the 0xc0 bits are flipped.) |
| 3614 |
|
|
| 3615 |
After \x, from zero to two hexadecimal digits are read (letters can be |
By default, after \x, from zero to two hexadecimal digits are read |
| 3616 |
in upper or lower case). Any number of hexadecimal digits may appear |
(letters can be in upper or lower case). Any number of hexadecimal dig- |
| 3617 |
between \x{ and }, but the value of the character code must be less |
its may appear between \x{ and }, but the value of the character code |
| 3618 |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
must be less than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 |
| 3619 |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
mode. That is, the maximum value in hexadecimal is 7FFFFFFF. Note that |
| 3620 |
than the largest Unicode code point, which is 10FFFF. |
this is bigger than the largest Unicode code point, which is 10FFFF. |
| 3621 |
|
|
| 3622 |
If characters other than hexadecimal digits appear between \x{ and }, |
If characters other than hexadecimal digits appear between \x{ and }, |
| 3623 |
or if there is no terminating }, this form of escape is not recognized. |
or if there is no terminating }, this form of escape is not recognized. |
| 3625 |
escape, with no following digits, giving a character whose value is |
escape, with no following digits, giving a character whose value is |
| 3626 |
zero. |
zero. |
| 3627 |
|
|
| 3628 |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x |
| 3629 |
|
is as just described only when it is followed by two hexadecimal dig- |
| 3630 |
|
its. Otherwise, it matches a literal "x" character. In JavaScript |
| 3631 |
|
mode, support for code points greater than 256 is provided by \u, which |
| 3632 |
|
must be followed by four hexadecimal digits; otherwise it matches a |
| 3633 |
|
literal "u" character. |
| 3634 |
|
|
| 3635 |
Characters whose value is less than 256 can be defined by either of the |
Characters whose value is less than 256 can be defined by either of the |
| 3636 |
two syntaxes for \x. There is no difference in the way they are han- |
two syntaxes for \x (or by \u in JavaScript mode). There is no differ- |
| 3637 |
dled. For example, \xdc is exactly the same as \x{dc}. |
ence in the way they are handled. For example, \xdc is exactly the same |
| 3638 |
|
as \x{dc} (or \u00dc in JavaScript mode). |
| 3639 |
|
|
| 3640 |
After \0 up to two further octal digits are read. If there are fewer |
After \0 up to two further octal digits are read. If there are fewer |
| 3641 |
than two digits, just those that are present are used. Thus the |
than two digits, just those that are present are used. Thus the |
| 3679 |
|
|
| 3680 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
| 3681 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
| 3682 |
class, the sequence \b is interpreted as the backspace character (hex |
class, \b is interpreted as the backspace character (hex 08). |
| 3683 |
08). The sequences \B, \N, \R, and \X are not special inside a charac- |
|
| 3684 |
ter class. Like any other unrecognized escape sequences, they are |
\N is not allowed in a character class. \B, \R, and \X are not special |
| 3685 |
treated as the literal characters "B", "N", "R", and "X" by default, |
inside a character class. Like other unrecognized escape sequences, |
| 3686 |
but cause an error if the PCRE_EXTRA option is set. Outside a character |
they are treated as the literal characters "B", "R", and "X" by |
| 3687 |
class, these sequences have different meanings. |
default, but cause an error if the PCRE_EXTRA option is set. Outside a |
| 3688 |
|
character class, these sequences have different meanings. |
| 3689 |
|
|
| 3690 |
|
Unsupported escape sequences |
| 3691 |
|
|
| 3692 |
|
In Perl, the sequences \l, \L, \u, and \U are recognized by its string |
| 3693 |
|
handler and used to modify the case of following characters. By |
| 3694 |
|
default, PCRE does not support these escape sequences. However, if the |
| 3695 |
|
PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and |
| 3696 |
|
\u can be used to define a character by code point, as described in the |
| 3697 |
|
previous section. |
| 3698 |
|
|
| 3699 |
Absolute and relative back references |
Absolute and relative back references |
| 3700 |
|
|
| 3729 |
|
|
| 3730 |
There is also the single sequence \N, which matches a non-newline char- |
There is also the single sequence \N, which matches a non-newline char- |
| 3731 |
acter. This is the same as the "." metacharacter when PCRE_DOTALL is |
acter. This is the same as the "." metacharacter when PCRE_DOTALL is |
| 3732 |
not set. |
not set. Perl also uses \N to match characters by name; PCRE does not |
| 3733 |
|
support this. |
| 3734 |
|
|
| 3735 |
Each pair of lower and upper case escape sequences partitions the com- |
Each pair of lower and upper case escape sequences partitions the com- |
| 3736 |
plete set of characters into two disjoint sets. Any given character |
plete set of characters into two disjoint sets. Any given character |
| 3737 |
matches one, and only one, of each pair. The sequences can appear both |
matches one, and only one, of each pair. The sequences can appear both |
| 3738 |
inside and outside character classes. They each match one character of |
inside and outside character classes. They each match one character of |
| 3739 |
the appropriate type. If the current matching point is at the end of |
the appropriate type. If the current matching point is at the end of |
| 3740 |
the subject string, all of them fail, because there is no character to |
the subject string, all of them fail, because there is no character to |
| 3741 |
match. |
match. |
| 3742 |
|
|
| 3743 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
| 3744 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
| 3745 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
| 3746 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
| 3747 |
ter. In PCRE, it never does. |
ter. In PCRE, it never does. |
| 3748 |
|
|
| 3749 |
A "word" character is an underscore or any character that is a letter |
A "word" character is an underscore or any character that is a letter |
| 3750 |
or digit. By default, the definition of letters and digits is con- |
or digit. By default, the definition of letters and digits is con- |
| 3751 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
| 3752 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
| 3753 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
| 3754 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
| 3755 |
are used for accented letters, and these are then matched by \w. The |
are used for accented letters, and these are then matched by \w. The |
| 3756 |
use of locales with Unicode is discouraged. |
use of locales with Unicode is discouraged. |
| 3757 |
|
|
| 3758 |
By default, in UTF-8 mode, characters with values greater than 128 |
By default, in UTF-8 mode, characters with values greater than 128 |
| 3759 |
never match \d, \s, or \w, and always match \D, \S, and \W. These |
never match \d, \s, or \w, and always match \D, \S, and \W. These |
| 3760 |
sequences retain their original meanings from before UTF-8 support was |
sequences retain their original meanings from before UTF-8 support was |
| 3761 |
available, mainly for efficiency reasons. However, if PCRE is compiled |
available, mainly for efficiency reasons. However, if PCRE is compiled |
| 3762 |
with Unicode property support, and the PCRE_UCP option is set, the be- |
with Unicode property support, and the PCRE_UCP option is set, the be- |
| 3763 |
haviour is changed so that Unicode properties are used to determine |
haviour is changed so that Unicode properties are used to determine |
| 3764 |
character types, as follows: |
character types, as follows: |
| 3765 |
|
|
| 3766 |
\d any character that \p{Nd} matches (decimal digit) |
\d any character that \p{Nd} matches (decimal digit) |
| 3767 |
\s any character that \p{Z} matches, plus HT, LF, FF, CR |
\s any character that \p{Z} matches, plus HT, LF, FF, CR |
| 3768 |
\w any character that \p{L} or \p{N} matches, plus underscore |
\w any character that \p{L} or \p{N} matches, plus underscore |
| 3769 |
|
|
| 3770 |
The upper case escapes match the inverse sets of characters. Note that |
The upper case escapes match the inverse sets of characters. Note that |
| 3771 |
\d matches only decimal digits, whereas \w matches any Unicode digit, |
\d matches only decimal digits, whereas \w matches any Unicode digit, |
| 3772 |
as well as any Unicode letter, and underscore. Note also that PCRE_UCP |
as well as any Unicode letter, and underscore. Note also that PCRE_UCP |
| 3773 |
affects \b, and \B because they are defined in terms of \w and \W. |
affects \b, and \B because they are defined in terms of \w and \W. |
| 3774 |
Matching these sequences is noticeably slower when PCRE_UCP is set. |
Matching these sequences is noticeably slower when PCRE_UCP is set. |
| 3775 |
|
|
| 3776 |
The sequences \h, \H, \v, and \V are features that were added to Perl |
The sequences \h, \H, \v, and \V are features that were added to Perl |
| 3777 |
at release 5.10. In contrast to the other sequences, which match only |
at release 5.10. In contrast to the other sequences, which match only |
| 3778 |
ASCII characters by default, these always match certain high-valued |
ASCII characters by default, these always match certain high-valued |
| 3779 |
codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon- |
codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon- |
| 3780 |
tal space characters are: |
tal space characters are: |
| 3781 |
|
|
| 3782 |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
| 3811 |
|
|
| 3812 |
Newline sequences |
Newline sequences |
| 3813 |
|
|
| 3814 |
Outside a character class, by default, the escape sequence \R matches |
Outside a character class, by default, the escape sequence \R matches |
| 3815 |
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the |
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the |
| 3816 |
following: |
following: |
| 3817 |
|
|
| 3818 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
| 3819 |
|
|
| 3820 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
| 3821 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
| 3822 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
| 3823 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
| 3824 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
| 3825 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
| 3826 |
|
|
| 3827 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
| 3828 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
| 3829 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
| 3830 |
these characters to be recognized. |
these characters to be recognized. |
| 3831 |
|
|
| 3832 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
| 3833 |
the complete set of Unicode line endings) by setting the option |
the complete set of Unicode line endings) by setting the option |
| 3834 |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
| 3835 |
(BSR is an abbrevation for "backslash R".) This can be made the default |
(BSR is an abbrevation for "backslash R".) This can be made the default |
| 3836 |
when PCRE is built; if this is the case, the other behaviour can be |
when PCRE is built; if this is the case, the other behaviour can be |
| 3837 |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
| 3838 |
specify these settings by starting a pattern string with one of the |
specify these settings by starting a pattern string with one of the |
| 3839 |
following sequences: |
following sequences: |
| 3840 |
|
|
| 3841 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
| 3842 |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
| 3843 |
|
|
| 3844 |
These override the default and the options given to pcre_compile() or |
These override the default and the options given to pcre_compile() or |
| 3845 |
pcre_compile2(), but they can be overridden by options given to |
pcre_compile2(), but they can be overridden by options given to |
| 3846 |
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which |
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which |
| 3847 |
are not Perl-compatible, are recognized only at the very start of a |
are not Perl-compatible, are recognized only at the very start of a |
| 3848 |
pattern, and that they must be in upper case. If more than one of them |
pattern, and that they must be in upper case. If more than one of them |
| 3849 |
is present, the last one is used. They can be combined with a change of |
is present, the last one is used. They can be combined with a change of |
| 3850 |
newline convention; for example, a pattern can start with: |
newline convention; for example, a pattern can start with: |
| 3851 |
|
|
| 3852 |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
| 3853 |
|
|
| 3854 |
They can also be combined with the (*UTF8) or (*UCP) special sequences. |
They can also be combined with the (*UTF8) or (*UCP) special sequences. |
| 3855 |
Inside a character class, \R is treated as an unrecognized escape |
Inside a character class, \R is treated as an unrecognized escape |
| 3856 |
sequence, and so matches the letter "R" by default, but causes an error |
sequence, and so matches the letter "R" by default, but causes an error |
| 3857 |
if PCRE_EXTRA is set. |
if PCRE_EXTRA is set. |
| 3858 |
|
|
| 3859 |
Unicode character properties |
Unicode character properties |
| 3860 |
|
|
| 3861 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
| 3862 |
tional escape sequences that match characters with specific properties |
tional escape sequences that match characters with specific properties |
| 3863 |
are available. When not in UTF-8 mode, these sequences are of course |
are available. When not in UTF-8 mode, these sequences are of course |
| 3864 |
limited to testing characters whose codepoints are less than 256, but |
limited to testing characters whose codepoints are less than 256, but |
| 3865 |
they do work in this mode. The extra escape sequences are: |
they do work in this mode. The extra escape sequences are: |
| 3866 |
|
|
| 3867 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
| 3868 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
| 3869 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
| 3870 |
|
|
| 3871 |
The property names represented by xx above are limited to the Unicode |
The property names represented by xx above are limited to the Unicode |
| 3872 |
script names, the general category properties, "Any", which matches any |
script names, the general category properties, "Any", which matches any |
| 3873 |
character (including newline), and some special PCRE properties |
character (including newline), and some special PCRE properties |
| 3874 |
(described in the next section). Other Perl properties such as "InMu- |
(described in the next section). Other Perl properties such as "InMu- |
| 3875 |
sicalSymbols" are not currently supported by PCRE. Note that \P{Any} |
sicalSymbols" are not currently supported by PCRE. Note that \P{Any} |
| 3876 |
does not match any characters, so always causes a match failure. |
does not match any characters, so always causes a match failure. |
| 3877 |
|
|
| 3878 |
Sets of Unicode characters are defined as belonging to certain scripts. |
Sets of Unicode characters are defined as belonging to certain scripts. |
| 3879 |
A character from one of these sets can be matched using a script name. |
A character from one of these sets can be matched using a script name. |
| 3880 |
For example: |
For example: |
| 3881 |
|
|
| 3882 |
\p{Greek} |
\p{Greek} |
| 3883 |
\P{Han} |
\P{Han} |
| 3884 |
|
|
| 3885 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
| 3886 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
| 3887 |
|
|
| 3888 |
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille, |
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille, |
| 3889 |
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, |
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, |
| 3890 |
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp- |
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp- |
| 3891 |
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, |
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, |
| 3892 |
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe- |
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe- |
| 3893 |
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, |
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, |
| 3894 |
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, |
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, |
| 3895 |
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, |
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, |
| 3896 |
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, |
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, |
| 3897 |
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, |
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, |
| 3898 |
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian, |
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian, |
| 3899 |
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, |
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, |
| 3900 |
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
| 3901 |
Ugaritic, Vai, Yi. |
Ugaritic, Vai, Yi. |
| 3902 |
|
|
| 3903 |
Each character has exactly one Unicode general category property, spec- |
Each character has exactly one Unicode general category property, spec- |
| 3904 |
ified by a two-letter abbreviation. For compatibility with Perl, nega- |
ified by a two-letter abbreviation. For compatibility with Perl, nega- |
| 3905 |
tion can be specified by including a circumflex between the opening |
tion can be specified by including a circumflex between the opening |
| 3906 |
brace and the property name. For example, \p{^Lu} is the same as |
brace and the property name. For example, \p{^Lu} is the same as |
| 3907 |
\P{Lu}. |
\P{Lu}. |
| 3908 |
|
|
| 3909 |
If only one letter is specified with \p or \P, it includes all the gen- |
If only one letter is specified with \p or \P, it includes all the gen- |
| 3910 |
eral category properties that start with that letter. In this case, in |
eral category properties that start with that letter. In this case, in |
| 3911 |
the absence of negation, the curly brackets in the escape sequence are |
the absence of negation, the curly brackets in the escape sequence are |
| 3912 |
optional; these two examples have the same effect: |
optional; these two examples have the same effect: |
| 3913 |
|
|
| 3914 |
\p{L} |
\p{L} |
| 3960 |
Zp Paragraph separator |
Zp Paragraph separator |
| 3961 |
Zs Space separator |
Zs Space separator |
| 3962 |
|
|
| 3963 |
The special property L& is also supported: it matches a character that |
The special property L& is also supported: it matches a character that |
| 3964 |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
| 3965 |
classified as a modifier or "other". |
classified as a modifier or "other". |
| 3966 |
|
|
| 3967 |
The Cs (Surrogate) property applies only to characters in the range |
The Cs (Surrogate) property applies only to characters in the range |
| 3968 |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
| 3969 |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
| 3970 |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
| 3971 |
the pcreapi page). Perl does not support the Cs property. |
the pcreapi page). Perl does not support the Cs property. |
| 3972 |
|
|
| 3973 |
The long synonyms for property names that Perl supports (such as |
The long synonyms for property names that Perl supports (such as |
| 3974 |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
| 3975 |
any of these properties with "Is". |
any of these properties with "Is". |
| 3976 |
|
|
| 3977 |
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
| 3978 |
erty. Instead, this property is assumed for any code point that is not |
erty. Instead, this property is assumed for any code point that is not |
| 3979 |
in the Unicode table. |
in the Unicode table. |
| 3980 |
|
|
| 3981 |
Specifying caseless matching does not affect these escape sequences. |
Specifying caseless matching does not affect these escape sequences. |
| 3982 |
For example, \p{Lu} always matches only upper case letters. |
For example, \p{Lu} always matches only upper case letters. |
| 3983 |
|
|
| 3984 |
The \X escape matches any number of Unicode characters that form an |
The \X escape matches any number of Unicode characters that form an |
| 3985 |
extended Unicode sequence. \X is equivalent to |
extended Unicode sequence. \X is equivalent to |
| 3986 |
|
|
| 3987 |
(?>\PM\pM*) |
(?>\PM\pM*) |
| 3988 |
|
|
| 3989 |
That is, it matches a character without the "mark" property, followed |
That is, it matches a character without the "mark" property, followed |
| 3990 |
by zero or more characters with the "mark" property, and treats the |
by zero or more characters with the "mark" property, and treats the |
| 3991 |
sequence as an atomic group (see below). Characters with the "mark" |
sequence as an atomic group (see below). Characters with the "mark" |
| 3992 |
property are typically accents that affect the preceding character. |
property are typically accents that affect the preceding character. |
| 3993 |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
| 3994 |
matches any one character. |
matches any one character. |
| 3995 |
|
|
| 3996 |
Note that recent versions of Perl have changed \X to match what Unicode |
Note that recent versions of Perl have changed \X to match what Unicode |
| 3997 |
calls an "extended grapheme cluster", which has a more complicated def- |
calls an "extended grapheme cluster", which has a more complicated def- |
| 3998 |
inition. |
inition. |
| 3999 |
|
|
| 4000 |
Matching characters by Unicode property is not fast, because PCRE has |
Matching characters by Unicode property is not fast, because PCRE has |
| 4001 |
to search a structure that contains data for over fifteen thousand |
to search a structure that contains data for over fifteen thousand |
| 4002 |
characters. That is why the traditional escape sequences such as \d and |
characters. That is why the traditional escape sequences such as \d and |
| 4003 |
\w do not use Unicode properties in PCRE by default, though you can |
\w do not use Unicode properties in PCRE by default, though you can |
| 4004 |
make them do so by setting the PCRE_UCP option for pcre_compile() or by |
make them do so by setting the PCRE_UCP option for pcre_compile() or by |
| 4005 |
starting the pattern with (*UCP). |
starting the pattern with (*UCP). |
| 4006 |
|
|
| 4007 |
PCRE's additional properties |
PCRE's additional properties |
| 4008 |
|
|
| 4009 |
As well as the standard Unicode properties described in the previous |
As well as the standard Unicode properties described in the previous |
| 4010 |
section, PCRE supports four more that make it possible to convert tra- |
section, PCRE supports four more that make it possible to convert tra- |
| 4011 |
ditional escape sequences such as \w and \s and POSIX character classes |
ditional escape sequences such as \w and \s and POSIX character classes |
| 4012 |
to use Unicode properties. PCRE uses these non-standard, non-Perl prop- |
to use Unicode properties. PCRE uses these non-standard, non-Perl prop- |
| 4013 |
erties internally when PCRE_UCP is set. They are: |
erties internally when PCRE_UCP is set. They are: |
| 4017 |
Xsp Any Perl space character |
Xsp Any Perl space character |
| 4018 |
Xwd Any Perl "word" character |
Xwd Any Perl "word" character |
| 4019 |
|
|
| 4020 |
Xan matches characters that have either the L (letter) or the N (num- |
Xan matches characters that have either the L (letter) or the N (num- |
| 4021 |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
| 4022 |
formfeed, or carriage return, and any other character that has the Z |
formfeed, or carriage return, and any other character that has the Z |
| 4023 |
(separator) property. Xsp is the same as Xps, except that vertical tab |
(separator) property. Xsp is the same as Xps, except that vertical tab |
| 4024 |
is excluded. Xwd matches the same characters as Xan, plus underscore. |
is excluded. Xwd matches the same characters as Xan, plus underscore. |
| 4025 |
|
|
| 4026 |
Resetting the match start |
Resetting the match start |
| 4027 |
|
|
| 4028 |
The escape sequence \K causes any previously matched characters not to |
The escape sequence \K causes any previously matched characters not to |
| 4029 |
be included in the final matched sequence. For example, the pattern: |
be included in the final matched sequence. For example, the pattern: |
| 4030 |
|
|
| 4031 |
foo\Kbar |
foo\Kbar |
| 4032 |
|
|
| 4033 |
matches "foobar", but reports that it has matched "bar". This feature |
matches "foobar", but reports that it has matched "bar". This feature |
| 4034 |
is similar to a lookbehind assertion (described below). However, in |
is similar to a lookbehind assertion (described below). However, in |
| 4035 |
this case, the part of the subject before the real match does not have |
this case, the part of the subject before the real match does not have |
| 4036 |
to be of fixed length, as lookbehind assertions do. The use of \K does |
to be of fixed length, as lookbehind assertions do. The use of \K does |
| 4037 |
not interfere with the setting of captured substrings. For example, |
not interfere with the setting of captured substrings. For example, |
| 4038 |
when the pattern |
when the pattern |
| 4039 |
|
|
| 4040 |
(foo)\Kbar |
(foo)\Kbar |
| 4041 |
|
|
| 4042 |
matches "foobar", the first substring is still set to "foo". |
matches "foobar", the first substring is still set to "foo". |
| 4043 |
|
|
| 4044 |
Perl documents that the use of \K within assertions is "not well |
Perl documents that the use of \K within assertions is "not well |
| 4045 |
defined". In PCRE, \K is acted upon when it occurs inside positive |
defined". In PCRE, \K is acted upon when it occurs inside positive |
| 4046 |
assertions, but is ignored in negative assertions. |
assertions, but is ignored in negative assertions. |
| 4047 |
|
|
| 4048 |
Simple assertions |
Simple assertions |
| 4049 |
|
|
| 4050 |
The final use of backslash is for certain simple assertions. An asser- |
The final use of backslash is for certain simple assertions. An asser- |
| 4051 |
tion specifies a condition that has to be met at a particular point in |
tion specifies a condition that has to be met at a particular point in |
| 4052 |
a match, without consuming any characters from the subject string. The |
a match, without consuming any characters from the subject string. The |
| 4053 |
use of subpatterns for more complicated assertions is described below. |
use of subpatterns for more complicated assertions is described below. |
| 4054 |
The backslashed assertions are: |
The backslashed assertions are: |
| 4055 |
|
|
| 4056 |
\b matches at a word boundary |
\b matches at a word boundary |
| 4061 |
\z matches only at the end of the subject |
\z matches only at the end of the subject |
| 4062 |
\G matches at the first matching position in the subject |
\G matches at the first matching position in the subject |
| 4063 |
|
|
| 4064 |
Inside a character class, \b has a different meaning; it matches the |
Inside a character class, \b has a different meaning; it matches the |
| 4065 |
backspace character. If any other of these assertions appears in a |
backspace character. If any other of these assertions appears in a |
| 4066 |
character class, by default it matches the corresponding literal char- |
character class, by default it matches the corresponding literal char- |
| 4067 |
acter (for example, \B matches the letter B). However, if the |
acter (for example, \B matches the letter B). However, if the |
| 4068 |
PCRE_EXTRA option is set, an "invalid escape sequence" error is gener- |
PCRE_EXTRA option is set, an "invalid escape sequence" error is gener- |
| 4069 |
ated instead. |
ated instead. |
| 4070 |
|
|
| 4071 |
A word boundary is a position in the subject string where the current |
A word boundary is a position in the subject string where the current |
| 4072 |
character and the previous character do not both match \w or \W (i.e. |
character and the previous character do not both match \w or \W (i.e. |
| 4073 |
one matches \w and the other matches \W), or the start or end of the |
one matches \w and the other matches \W), or the start or end of the |
| 4074 |
string if the first or last character matches \w, respectively. In |
string if the first or last character matches \w, respectively. In |
| 4075 |
UTF-8 mode, the meanings of \w and \W can be changed by setting the |
UTF-8 mode, the meanings of \w and \W can be changed by setting the |
| 4076 |
PCRE_UCP option. When this is done, it also affects \b and \B. Neither |
PCRE_UCP option. When this is done, it also affects \b and \B. Neither |
| 4077 |
PCRE nor Perl has a separate "start of word" or "end of word" metase- |
PCRE nor Perl has a separate "start of word" or "end of word" metase- |
| 4078 |
quence. However, whatever follows \b normally determines which it is. |
quence. However, whatever follows \b normally determines which it is. |
| 4079 |
For example, the fragment \ba matches "a" at the start of a word. |
For example, the fragment \ba matches "a" at the start of a word. |
| 4080 |
|
|
| 4081 |
The \A, \Z, and \z assertions differ from the traditional circumflex |
The \A, \Z, and \z assertions differ from the traditional circumflex |
| 4082 |
and dollar (described in the next section) in that they only ever match |
and dollar (described in the next section) in that they only ever match |
| 4083 |
at the very start and end of the subject string, whatever options are |
at the very start and end of the subject string, whatever options are |
| 4084 |
set. Thus, they are independent of multiline mode. These three asser- |
set. Thus, they are independent of multiline mode. These three asser- |
| 4085 |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
| 4086 |
affect only the behaviour of the circumflex and dollar metacharacters. |
affect only the behaviour of the circumflex and dollar metacharacters. |
| 4087 |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
| 4088 |
cating that matching is to start at a point other than the beginning of |
cating that matching is to start at a point other than the beginning of |
| 4089 |
the subject, \A can never match. The difference between \Z and \z is |
the subject, \A can never match. The difference between \Z and \z is |
| 4090 |
that \Z matches before a newline at the end of the string as well as at |
that \Z matches before a newline at the end of the string as well as at |
| 4091 |
the very end, whereas \z matches only at the end. |
the very end, whereas \z matches only at the end. |
| 4092 |
|
|
| 4093 |
The \G assertion is true only when the current matching position is at |
The \G assertion is true only when the current matching position is at |
| 4094 |
the start point of the match, as specified by the startoffset argument |
the start point of the match, as specified by the startoffset argument |
| 4095 |
of pcre_exec(). It differs from \A when the value of startoffset is |
of pcre_exec(). It differs from \A when the value of startoffset is |
| 4096 |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
| 4097 |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
| 4098 |
mentation where \G can be useful. |
mentation where \G can be useful. |
| 4099 |
|
|
| 4100 |
Note, however, that PCRE's interpretation of \G, as the start of the |
Note, however, that PCRE's interpretation of \G, as the start of the |
| 4101 |
current match, is subtly different from Perl's, which defines it as the |
current match, is subtly different from Perl's, which defines it as the |
| 4102 |
end of the previous match. In Perl, these can be different when the |
end of the previous match. In Perl, these can be different when the |
| 4103 |
previously matched string was empty. Because PCRE does just one match |
previously matched string was empty. Because PCRE does just one match |
| 4104 |
at a time, it cannot reproduce this behaviour. |
at a time, it cannot reproduce this behaviour. |
| 4105 |
|
|
| 4106 |
If all the alternatives of a pattern begin with \G, the expression is |
If all the alternatives of a pattern begin with \G, the expression is |
| 4107 |
anchored to the starting match position, and the "anchored" flag is set |
anchored to the starting match position, and the "anchored" flag is set |
| 4108 |
in the compiled regular expression. |
in the compiled regular expression. |
| 4109 |
|
|
| 4111 |
CIRCUMFLEX AND DOLLAR |
CIRCUMFLEX AND DOLLAR |
| 4112 |
|
|
| 4113 |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
| 4114 |
character is an assertion that is true only if the current matching |
character is an assertion that is true only if the current matching |
| 4115 |
point is at the start of the subject string. If the startoffset argu- |
point is at the start of the subject string. If the startoffset argu- |
| 4116 |
ment of pcre_exec() is non-zero, circumflex can never match if the |
ment of pcre_exec() is non-zero, circumflex can never match if the |
| 4117 |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
| 4118 |
has an entirely different meaning (see below). |
has an entirely different meaning (see below). |
| 4119 |
|
|
| 4120 |
Circumflex need not be the first character of the pattern if a number |
Circumflex need not be the first character of the pattern if a number |
| 4121 |
of alternatives are involved, but it should be the first thing in each |
of alternatives are involved, but it should be the first thing in each |
| 4122 |
alternative in which it appears if the pattern is ever to match that |
alternative in which it appears if the pattern is ever to match that |
| 4123 |
branch. If all possible alternatives start with a circumflex, that is, |
branch. If all possible alternatives start with a circumflex, that is, |
| 4124 |
if the pattern is constrained to match only at the start of the sub- |
if the pattern is constrained to match only at the start of the sub- |
| 4125 |
ject, it is said to be an "anchored" pattern. (There are also other |
ject, it is said to be an "anchored" pattern. (There are also other |
| 4126 |
constructs that can cause a pattern to be anchored.) |
constructs that can cause a pattern to be anchored.) |
| 4127 |
|
|
| 4128 |
A dollar character is an assertion that is true only if the current |
A dollar character is an assertion that is true only if the current |
| 4129 |
matching point is at the end of the subject string, or immediately |
matching point is at the end of the subject string, or immediately |
| 4130 |
before a newline at the end of the string (by default). Dollar need not |
before a newline at the end of the string (by default). Dollar need not |
| 4131 |
be the last character of the pattern if a number of alternatives are |
be the last character of the pattern if a number of alternatives are |
| 4132 |
involved, but it should be the last item in any branch in which it |
involved, but it should be the last item in any branch in which it |
| 4133 |
appears. Dollar has no special meaning in a character class. |
appears. Dollar has no special meaning in a character class. |
| 4134 |
|
|
| 4135 |
The meaning of dollar can be changed so that it matches only at the |
The meaning of dollar can be changed so that it matches only at the |
| 4136 |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
| 4137 |
compile time. This does not affect the \Z assertion. |
compile time. This does not affect the \Z assertion. |
| 4138 |
|
|
| 4139 |
The meanings of the circumflex and dollar characters are changed if the |
The meanings of the circumflex and dollar characters are changed if the |
| 4140 |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
| 4141 |
matches immediately after internal newlines as well as at the start of |
matches immediately after internal newlines as well as at the start of |
| 4142 |
the subject string. It does not match after a newline that ends the |
the subject string. It does not match after a newline that ends the |
| 4143 |
string. A dollar matches before any newlines in the string, as well as |
string. A dollar matches before any newlines in the string, as well as |
| 4144 |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
| 4145 |
as the two-character sequence CRLF, isolated CR and LF characters do |
as the two-character sequence CRLF, isolated CR and LF characters do |
| 4146 |
not indicate newlines. |
not indicate newlines. |
| 4147 |
|
|
| 4148 |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
| 4149 |
(where \n represents a newline) in multiline mode, but not otherwise. |
(where \n represents a newline) in multiline mode, but not otherwise. |
| 4150 |
Consequently, patterns that are anchored in single line mode because |
Consequently, patterns that are anchored in single line mode because |
| 4151 |
all branches start with ^ are not anchored in multiline mode, and a |
all branches start with ^ are not anchored in multiline mode, and a |
| 4152 |
match for circumflex is possible when the startoffset argument of |
match for circumflex is possible when the startoffset argument of |
| 4153 |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
| 4154 |
PCRE_MULTILINE is set. |
PCRE_MULTILINE is set. |
| 4155 |
|
|
| 4156 |
Note that the sequences \A, \Z, and \z can be used to match the start |
Note that the sequences \A, \Z, and \z can be used to match the start |
| 4157 |
and end of the subject in both modes, and if all branches of a pattern |
and end of the subject in both modes, and if all branches of a pattern |
| 4158 |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
| 4159 |
set. |
set. |
| 4160 |
|
|
| 4161 |
|
|
| 4162 |
FULL STOP (PERIOD, DOT) AND \N |
FULL STOP (PERIOD, DOT) AND \N |
| 4163 |
|
|
| 4164 |
Outside a character class, a dot in the pattern matches any one charac- |
Outside a character class, a dot in the pattern matches any one charac- |
| 4165 |
ter in the subject string except (by default) a character that signi- |
ter in the subject string except (by default) a character that signi- |
| 4166 |
fies the end of a line. In UTF-8 mode, the matched character may be |
fies the end of a line. In UTF-8 mode, the matched character may be |
| 4167 |
more than one byte long. |
more than one byte long. |
| 4168 |
|
|
| 4169 |
When a line ending is defined as a single character, dot never matches |
When a line ending is defined as a single character, dot never matches |
| 4170 |
that character; when the two-character sequence CRLF is used, dot does |
that character; when the two-character sequence CRLF is used, dot does |
| 4171 |
not match CR if it is immediately followed by LF, but otherwise it |
not match CR if it is immediately followed by LF, but otherwise it |
| 4172 |
matches all characters (including isolated CRs and LFs). When any Uni- |
matches all characters (including isolated CRs and LFs). When any Uni- |
| 4173 |
code line endings are being recognized, dot does not match CR or LF or |
code line endings are being recognized, dot does not match CR or LF or |
| 4174 |
any of the other line ending characters. |
any of the other line ending characters. |
| 4175 |
|
|
| 4176 |
The behaviour of dot with regard to newlines can be changed. If the |
The behaviour of dot with regard to newlines can be changed. If the |
| 4177 |
PCRE_DOTALL option is set, a dot matches any one character, without |
PCRE_DOTALL option is set, a dot matches any one character, without |
| 4178 |
exception. If the two-character sequence CRLF is present in the subject |
exception. If the two-character sequence CRLF is present in the subject |
| 4179 |
string, it takes two dots to match it. |
string, it takes two dots to match it. |
| 4180 |
|
|
| 4181 |
The handling of dot is entirely independent of the handling of circum- |
The handling of dot is entirely independent of the handling of circum- |
| 4182 |
flex and dollar, the only relationship being that they both involve |
flex and dollar, the only relationship being that they both involve |
| 4183 |
newlines. Dot has no special meaning in a character class. |
newlines. Dot has no special meaning in a character class. |
| 4184 |
|
|
| 4185 |
The escape sequence \N behaves like a dot, except that it is not |
The escape sequence \N behaves like a dot, except that it is not |
| 4186 |
affected by the PCRE_DOTALL option. In other words, it matches any |
affected by the PCRE_DOTALL option. In other words, it matches any |
| 4187 |
character except one that signifies the end of a line. |
character except one that signifies the end of a line. Perl also uses |
| 4188 |
|
\N to match characters by name; PCRE does not support this. |
| 4189 |
|
|
| 4190 |
|
|
| 4191 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
| 4202 |
PCRE_NO_UTF8_CHECK option is used). |
PCRE_NO_UTF8_CHECK option is used). |
| 4203 |
|
|
| 4204 |
PCRE does not allow \C to appear in lookbehind assertions (described |
PCRE does not allow \C to appear in lookbehind assertions (described |
| 4205 |
below), because in UTF-8 mode this would make it impossible to calcu- |
below) in UTF-8 mode, because this would make it impossible to calcu- |
| 4206 |
late the length of the lookbehind. |
late the length of the lookbehind. |
| 4207 |
|
|
| 4208 |
In general, the \C escape sequence is best avoided in UTF-8 mode. How- |
In general, the \C escape sequence is best avoided in UTF-8 mode. How- |
| 5109 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
| 5110 |
rent position, the assertion fails. |
rent position, the assertion fails. |
| 5111 |
|
|
| 5112 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
In UTF-8 mode, PCRE does not allow the \C escape (which matches a sin- |
| 5113 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
gle byte, even in UTF-8 mode) to appear in lookbehind assertions, |
| 5114 |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
because it makes it impossible to calculate the length of the lookbe- |
| 5115 |
which can match different numbers of bytes, are also not permitted. |
hind. The \X and \R escapes, which can match different numbers of |
| 5116 |
|
bytes, are also not permitted. |
| 5117 |
|
|
| 5118 |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
| 5119 |
lookbehinds, as long as the subpattern matches a fixed-length string. |
lookbehinds, as long as the subpattern matches a fixed-length string. |
| 5120 |
Recursion, however, is not supported. |
Recursion, however, is not supported. |
| 5121 |
|
|
| 5122 |
Possessive quantifiers can be used in conjunction with lookbehind |
Possessive quantifiers can be used in conjunction with lookbehind |
| 5123 |
assertions to specify efficient matching of fixed-length strings at the |
assertions to specify efficient matching of fixed-length strings at the |
| 5124 |
end of subject strings. Consider a simple pattern such as |
end of subject strings. Consider a simple pattern such as |
| 5125 |
|
|
| 5126 |
abcd$ |
abcd$ |
| 5127 |
|
|
| 5128 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
| 5129 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
| 5130 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
| 5131 |
pattern is specified as |
pattern is specified as |
| 5132 |
|
|
| 5133 |
^.*abcd$ |
^.*abcd$ |
| 5134 |
|
|
| 5135 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
| 5136 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
| 5137 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
| 5138 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
| 5139 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
| 5140 |
|
|
| 5141 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
| 5142 |
|
|
| 5143 |
there can be no backtracking for the .*+ item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
| 5144 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
| 5145 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
| 5146 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
| 5147 |
processing time. |
processing time. |
| 5148 |
|
|
| 5149 |
Using multiple assertions |
Using multiple assertions |
| 5152 |
|
|
| 5153 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
| 5154 |
|
|
| 5155 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
| 5156 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
| 5157 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
| 5158 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
| 5159 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
| 5160 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
| 5161 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
| 5162 |
foo". A pattern to do that is |
foo". A pattern to do that is |
| 5163 |
|
|
| 5164 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
| 5165 |
|
|
| 5166 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
| 5167 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
| 5168 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
| 5169 |
|
|
| 5171 |
|
|
| 5172 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
| 5173 |
|
|
| 5174 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
| 5175 |
is not preceded by "foo", while |
is not preceded by "foo", while |
| 5176 |
|
|
| 5177 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
| 5178 |
|
|
| 5179 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
| 5180 |
three characters that are not "999". |
three characters that are not "999". |
| 5181 |
|
|
| 5182 |
|
|
| 5183 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
| 5184 |
|
|
| 5185 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
| 5186 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
| 5187 |
on the result of an assertion, or whether a specific capturing subpat- |
on the result of an assertion, or whether a specific capturing subpat- |
| 5188 |
tern has already been matched. The two possible forms of conditional |
tern has already been matched. The two possible forms of conditional |
| 5189 |
subpattern are: |
subpattern are: |
| 5190 |
|
|
| 5191 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
| 5192 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
| 5193 |
|
|
| 5194 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
| 5195 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
| 5196 |
tives in the subpattern, a compile-time error occurs. Each of the two |
tives in the subpattern, a compile-time error occurs. Each of the two |
| 5197 |
alternatives may itself contain nested subpatterns of any form, includ- |
alternatives may itself contain nested subpatterns of any form, includ- |
| 5198 |
ing conditional subpatterns; the restriction to two alternatives |
ing conditional subpatterns; the restriction to two alternatives |
| 5199 |
applies only at the level of the condition. This pattern fragment is an |
applies only at the level of the condition. This pattern fragment is an |
| 5202 |
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
| 5203 |
|
|
| 5204 |
|
|
| 5205 |
There are four kinds of condition: references to subpatterns, refer- |
There are four kinds of condition: references to subpatterns, refer- |
| 5206 |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
| 5207 |
|
|
| 5208 |
Checking for a used subpattern by number |
Checking for a used subpattern by number |
| 5209 |
|
|
| 5210 |
If the text between the parentheses consists of a sequence of digits, |
If the text between the parentheses consists of a sequence of digits, |
| 5211 |
the condition is true if a capturing subpattern of that number has pre- |
the condition is true if a capturing subpattern of that number has pre- |
| 5212 |
viously matched. If there is more than one capturing subpattern with |
viously matched. If there is more than one capturing subpattern with |
| 5213 |
the same number (see the earlier section about duplicate subpattern |
the same number (see the earlier section about duplicate subpattern |
| 5214 |
numbers), the condition is true if any of them have matched. An alter- |
numbers), the condition is true if any of them have matched. An alter- |
| 5215 |
native notation is to precede the digits with a plus or minus sign. In |
native notation is to precede the digits with a plus or minus sign. In |
| 5216 |
this case, the subpattern number is relative rather than absolute. The |
this case, the subpattern number is relative rather than absolute. The |
| 5217 |
most recently opened parentheses can be referenced by (?(-1), the next |
most recently opened parentheses can be referenced by (?(-1), the next |
| 5218 |
most recent by (?(-2), and so on. Inside loops it can also make sense |
most recent by (?(-2), and so on. Inside loops it can also make sense |
| 5219 |
to refer to subsequent groups. The next parentheses to be opened can be |
to refer to subsequent groups. The next parentheses to be opened can be |
| 5220 |
referenced as (?(+1), and so on. (The value zero in any of these forms |
referenced as (?(+1), and so on. (The value zero in any of these forms |
| 5221 |
is not used; it provokes a compile-time error.) |
is not used; it provokes a compile-time error.) |
| 5222 |
|
|
| 5223 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
| 5224 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
| 5225 |
divide it into three parts for ease of discussion: |
divide it into three parts for ease of discussion: |
| 5226 |
|
|
| 5227 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
| 5228 |
|
|
| 5229 |
The first part matches an optional opening parenthesis, and if that |
The first part matches an optional opening parenthesis, and if that |
| 5230 |
character is present, sets it as the first captured substring. The sec- |
character is present, sets it as the first captured substring. The sec- |
| 5231 |
ond part matches one or more characters that are not parentheses. The |
ond part matches one or more characters that are not parentheses. The |
| 5232 |
third part is a conditional subpattern that tests whether or not the |
third part is a conditional subpattern that tests whether or not the |
| 5233 |
first set of parentheses matched. If they did, that is, if subject |
first set of parentheses matched. If they did, that is, if subject |
| 5234 |
started with an opening parenthesis, the condition is true, and so the |
started with an opening parenthesis, the condition is true, and so the |
| 5235 |
yes-pattern is executed and a closing parenthesis is required. Other- |
yes-pattern is executed and a closing parenthesis is required. Other- |
| 5236 |
wise, since no-pattern is not present, the subpattern matches nothing. |
wise, since no-pattern is not present, the subpattern matches nothing. |
| 5237 |
In other words, this pattern matches a sequence of non-parentheses, |
In other words, this pattern matches a sequence of non-parentheses, |
| 5238 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. |
| 5239 |
|
|
| 5240 |
If you were embedding this pattern in a larger one, you could use a |
If you were embedding this pattern in a larger one, you could use a |
| 5241 |
relative reference: |
relative reference: |
| 5242 |
|
|
| 5243 |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
| 5244 |
|
|
| 5245 |
This makes the fragment independent of the parentheses in the larger |
This makes the fragment independent of the parentheses in the larger |
| 5246 |
pattern. |
pattern. |
| 5247 |
|
|
| 5248 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
| 5249 |
|
|
| 5250 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
| 5251 |
used subpattern by name. For compatibility with earlier versions of |
used subpattern by name. For compatibility with earlier versions of |
| 5252 |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
| 5253 |
also recognized. However, there is a possible ambiguity with this syn- |
also recognized. However, there is a possible ambiguity with this syn- |
| 5254 |
tax, because subpattern names may consist entirely of digits. PCRE |
tax, because subpattern names may consist entirely of digits. PCRE |
| 5255 |
looks first for a named subpattern; if it cannot find one and the name |
looks first for a named subpattern; if it cannot find one and the name |
| 5256 |
consists entirely of digits, PCRE looks for a subpattern of that num- |
consists entirely of digits, PCRE looks for a subpattern of that num- |
| 5257 |
ber, which must be greater than zero. Using subpattern names that con- |
ber, which must be greater than zero. Using subpattern names that con- |
| 5258 |
sist entirely of digits is not recommended. |
sist entirely of digits is not recommended. |
| 5259 |
|
|
| 5260 |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
| 5261 |
|
|
| 5262 |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
| 5263 |
|
|
| 5264 |
If the name used in a condition of this kind is a duplicate, the test |
If the name used in a condition of this kind is a duplicate, the test |
| 5265 |
is applied to all subpatterns of the same name, and is true if any one |
is applied to all subpatterns of the same name, and is true if any one |
| 5266 |
of them has matched. |
of them has matched. |
| 5267 |
|
|
| 5268 |
Checking for pattern recursion |
Checking for pattern recursion |
| 5269 |
|
|
| 5270 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
| 5271 |
name R, the condition is true if a recursive call to the whole pattern |
name R, the condition is true if a recursive call to the whole pattern |
| 5272 |
or any subpattern has been made. If digits or a name preceded by amper- |
or any subpattern has been made. If digits or a name preceded by amper- |
| 5273 |
sand follow the letter R, for example: |
sand follow the letter R, for example: |
| 5274 |
|
|
| 5276 |
|
|
| 5277 |
the condition is true if the most recent recursion is into a subpattern |
the condition is true if the most recent recursion is into a subpattern |
| 5278 |
whose number or name is given. This condition does not check the entire |
whose number or name is given. This condition does not check the entire |
| 5279 |
recursion stack. If the name used in a condition of this kind is a |
recursion stack. If the name used in a condition of this kind is a |
| 5280 |
duplicate, the test is applied to all subpatterns of the same name, and |
duplicate, the test is applied to all subpatterns of the same name, and |
| 5281 |
is true if any one of them is the most recent recursion. |
is true if any one of them is the most recent recursion. |
| 5282 |
|
|
| 5283 |
At "top level", all these recursion test conditions are false. The |
At "top level", all these recursion test conditions are false. The |
| 5284 |
syntax for recursive patterns is described below. |
syntax for recursive patterns is described below. |
| 5285 |
|
|
| 5286 |
Defining subpatterns for use by reference only |
Defining subpatterns for use by reference only |
| 5287 |
|
|
| 5288 |
If the condition is the string (DEFINE), and there is no subpattern |
If the condition is the string (DEFINE), and there is no subpattern |
| 5289 |
with the name DEFINE, the condition is always false. In this case, |
with the name DEFINE, the condition is always false. In this case, |
| 5290 |
there may be only one alternative in the subpattern. It is always |
there may be only one alternative in the subpattern. It is always |
| 5291 |
skipped if control reaches this point in the pattern; the idea of |
skipped if control reaches this point in the pattern; the idea of |
| 5292 |
DEFINE is that it can be used to define subroutines that can be refer- |
DEFINE is that it can be used to define subroutines that can be refer- |
| 5293 |
enced from elsewhere. (The use of subroutines is described below.) For |
enced from elsewhere. (The use of subroutines is described below.) For |
| 5294 |
example, a pattern to match an IPv4 address such as "192.168.23.245" |
example, a pattern to match an IPv4 address such as "192.168.23.245" |
| 5295 |
could be written like this (ignore whitespace and line breaks): |
could be written like this (ignore whitespace and line breaks): |
| 5296 |
|
|
| 5297 |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| 5298 |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
| 5299 |
|
|
| 5300 |
The first part of the pattern is a DEFINE group inside which a another |
The first part of the pattern is a DEFINE group inside which a another |
| 5301 |
group named "byte" is defined. This matches an individual component of |
group named "byte" is defined. This matches an individual component of |
| 5302 |
an IPv4 address (a number less than 256). When matching takes place, |
an IPv4 address (a number less than 256). When matching takes place, |
| 5303 |
this part of the pattern is skipped because DEFINE acts like a false |
this part of the pattern is skipped because DEFINE acts like a false |
| 5304 |
condition. The rest of the pattern uses references to the named group |
condition. The rest of the pattern uses references to the named group |
| 5305 |
to match the four dot-separated components of an IPv4 address, insist- |
to match the four dot-separated components of an IPv4 address, insist- |
| 5306 |
ing on a word boundary at each end. |
ing on a word boundary at each end. |
| 5307 |
|
|
| 5308 |
Assertion conditions |
Assertion conditions |
| 5309 |
|
|
| 5310 |
If the condition is not in any of the above formats, it must be an |
If the condition is not in any of the above formats, it must be an |
| 5311 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
| 5312 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
| 5313 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
| 5314 |
|
|
| 5315 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
| 5316 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
| 5317 |
|
|
| 5318 |
The condition is a positive lookahead assertion that matches an |
The condition is a positive lookahead assertion that matches an |
| 5319 |
optional sequence of non-letters followed by a letter. In other words, |
optional sequence of non-letters followed by a letter. In other words, |
| 5320 |
it tests for the presence of at least one letter in the subject. If a |
it tests for the presence of at least one letter in the subject. If a |
| 5321 |
letter is found, the subject is matched against the first alternative; |
letter is found, the subject is matched against the first alternative; |
| 5322 |
otherwise it is matched against the second. This pattern matches |
otherwise it is matched against the second. This pattern matches |
| 5323 |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
| 5324 |
letters and dd are digits. |
letters and dd are digits. |
| 5325 |
|
|
| 5326 |
|
|
| 5329 |
There are two ways of including comments in patterns that are processed |
There are two ways of including comments in patterns that are processed |
| 5330 |
by PCRE. In both cases, the start of the comment must not be in a char- |
by PCRE. In both cases, the start of the comment must not be in a char- |
| 5331 |
acter class, nor in the middle of any other sequence of related charac- |
acter class, nor in the middle of any other sequence of related charac- |
| 5332 |
ters such as (?: or a subpattern name or number. The characters that |
ters such as (?: or a subpattern name or number. The characters that |
| 5333 |
make up a comment play no part in the pattern matching. |
make up a comment play no part in the pattern matching. |
| 5334 |
|
|
| 5335 |
The sequence (?# marks the start of a comment that continues up to the |
The sequence (?# marks the start of a comment that continues up to the |
| 5336 |
next closing parenthesis. Nested parentheses are not permitted. If the |
next closing parenthesis. Nested parentheses are not permitted. If the |
| 5337 |
PCRE_EXTENDED option is set, an unescaped # character also introduces a |
PCRE_EXTENDED option is set, an unescaped # character also introduces a |
| 5338 |
comment, which in this case continues to immediately after the next |
comment, which in this case continues to immediately after the next |
| 5339 |
newline character or character sequence in the pattern. Which charac- |
newline character or character sequence in the pattern. Which charac- |
| 5340 |
ters are interpreted as newlines is controlled by the options passed to |
ters are interpreted as newlines is controlled by the options passed to |
| 5341 |
pcre_compile() or by a special sequence at the start of the pattern, as |
pcre_compile() or by a special sequence at the start of the pattern, as |
| 5342 |
described in the section entitled "Newline conventions" above. Note |
described in the section entitled "Newline conventions" above. Note |
| 5343 |
that the end of this type of comment is a literal newline sequence in |
that the end of this type of comment is a literal newline sequence in |
| 5344 |
the pattern; escape sequences that happen to represent a newline do not |
the pattern; escape sequences that happen to represent a newline do not |
| 5345 |
count. For example, consider this pattern when PCRE_EXTENDED is set, |
count. For example, consider this pattern when PCRE_EXTENDED is set, |
| 5346 |
and the default newline convention is in force: |
and the default newline convention is in force: |
| 5347 |
|
|
| 5348 |
abc #comment \n still comment |
abc #comment \n still comment |
| 5349 |
|
|
| 5350 |
On encountering the # character, pcre_compile() skips along, looking |
On encountering the # character, pcre_compile() skips along, looking |
| 5351 |
for a newline in the pattern. The sequence \n is still literal at this |
for a newline in the pattern. The sequence \n is still literal at this |
| 5352 |
stage, so it does not terminate the comment. Only an actual character |
stage, so it does not terminate the comment. Only an actual character |
| 5353 |
with the code value 0x0a (the default newline) does so. |
with the code value 0x0a (the default newline) does so. |
| 5354 |
|
|
| 5355 |
|
|
| 5356 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
| 5357 |
|
|
| 5358 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
| 5359 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
| 5360 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
| 5361 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
| 5362 |
depth. |
depth. |
| 5363 |
|
|
| 5364 |
For some time, Perl has provided a facility that allows regular expres- |
For some time, Perl has provided a facility that allows regular expres- |
| 5365 |
sions to recurse (amongst other things). It does this by interpolating |
sions to recurse (amongst other things). It does this by interpolating |
| 5366 |
Perl code in the expression at run time, and the code can refer to the |
Perl code in the expression at run time, and the code can refer to the |
| 5367 |
expression itself. A Perl pattern using code interpolation to solve the |
expression itself. A Perl pattern using code interpolation to solve the |
| 5368 |
parentheses problem can be created like this: |
parentheses problem can be created like this: |
| 5369 |
|
|
| 5373 |
refers recursively to the pattern in which it appears. |
refers recursively to the pattern in which it appears. |
| 5374 |
|
|
| 5375 |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
| 5376 |
it supports special syntax for recursion of the entire pattern, and |
it supports special syntax for recursion of the entire pattern, and |
| 5377 |
also for individual subpattern recursion. After its introduction in |
also for individual subpattern recursion. After its introduction in |
| 5378 |
PCRE and Python, this kind of recursion was subsequently introduced |
PCRE and Python, this kind of recursion was subsequently introduced |
| 5379 |
into Perl at release 5.10. |
into Perl at release 5.10. |
| 5380 |
|
|
| 5381 |
A special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
| 5382 |
zero and a closing parenthesis is a recursive subroutine call of the |
zero and a closing parenthesis is a recursive subroutine call of the |
| 5383 |
subpattern of the given number, provided that it occurs inside that |
subpattern of the given number, provided that it occurs inside that |
| 5384 |
subpattern. (If not, it is a non-recursive subroutine call, which is |
subpattern. (If not, it is a non-recursive subroutine call, which is |
| 5385 |
described in the next section.) The special item (?R) or (?0) is a |
described in the next section.) The special item (?R) or (?0) is a |
| 5386 |
recursive call of the entire regular expression. |
recursive call of the entire regular expression. |
| 5387 |
|
|
| 5388 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
| 5389 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
| 5390 |
|
|
| 5391 |
\( ( [^()]++ | (?R) )* \) |
\( ( [^()]++ | (?R) )* \) |
| 5392 |
|
|
| 5393 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
| 5394 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
| 5395 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
| 5396 |
sized substring). Finally there is a closing parenthesis. Note the use |
sized substring). Finally there is a closing parenthesis. Note the use |
| 5397 |
of a possessive quantifier to avoid backtracking into sequences of non- |
of a possessive quantifier to avoid backtracking into sequences of non- |
| 5398 |
parentheses. |
parentheses. |
| 5399 |
|
|
| 5400 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
| 5401 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
| 5402 |
|
|
| 5403 |
( \( ( [^()]++ | (?1) )* \) ) |
( \( ( [^()]++ | (?1) )* \) ) |
| 5404 |
|
|
| 5405 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
| 5406 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
| 5407 |
|
|
| 5408 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
| 5409 |
tricky. This is made easier by the use of relative references. Instead |
tricky. This is made easier by the use of relative references. Instead |
| 5410 |
of (?1) in the pattern above you can write (?-2) to refer to the second |
of (?1) in the pattern above you can write (?-2) to refer to the second |
| 5411 |
most recently opened parentheses preceding the recursion. In other |
most recently opened parentheses preceding the recursion. In other |
| 5412 |
words, a negative number counts capturing parentheses leftwards from |
words, a negative number counts capturing parentheses leftwards from |
| 5413 |
the point at which it is encountered. |
the point at which it is encountered. |
| 5414 |
|
|
| 5415 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
| 5416 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
| 5417 |
because the reference is not inside the parentheses that are refer- |
because the reference is not inside the parentheses that are refer- |
| 5418 |
enced. They are always non-recursive subroutine calls, as described in |
enced. They are always non-recursive subroutine calls, as described in |
| 5419 |
the next section. |
the next section. |
| 5420 |
|
|
| 5421 |
An alternative approach is to use named parentheses instead. The Perl |
An alternative approach is to use named parentheses instead. The Perl |
| 5422 |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
| 5423 |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
| 5424 |
|
|
| 5425 |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
| 5426 |
|
|
| 5427 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
| 5428 |
one is used. |
one is used. |
| 5429 |
|
|
| 5430 |
This particular example pattern that we have been looking at contains |
This particular example pattern that we have been looking at contains |
| 5431 |
nested unlimited repeats, and so the use of a possessive quantifier for |
nested unlimited repeats, and so the use of a possessive quantifier for |
| 5432 |
matching strings of non-parentheses is important when applying the pat- |
matching strings of non-parentheses is important when applying the pat- |
| 5433 |
tern to strings that do not match. For example, when this pattern is |
tern to strings that do not match. For example, when this pattern is |
| 5434 |
applied to |
applied to |
| 5435 |
|
|
| 5436 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| 5437 |
|
|
| 5438 |
it yields "no match" quickly. However, if a possessive quantifier is |
it yields "no match" quickly. However, if a possessive quantifier is |
| 5439 |
not used, the match runs for a very long time indeed because there are |
not used, the match runs for a very long time indeed because there are |
| 5440 |
so many different ways the + and * repeats can carve up the subject, |
so many different ways the + and * repeats can carve up the subject, |
| 5441 |
and all have to be tested before failure can be reported. |
and all have to be tested before failure can be reported. |
| 5442 |
|
|
| 5443 |
At the end of a match, the values of capturing parentheses are those |
At the end of a match, the values of capturing parentheses are those |
| 5444 |
from the outermost level. If you want to obtain intermediate values, a |
from the outermost level. If you want to obtain intermediate values, a |
| 5445 |
callout function can be used (see below and the pcrecallout documenta- |
callout function can be used (see below and the pcrecallout documenta- |
| 5446 |
tion). If the pattern above is matched against |
tion). If the pattern above is matched against |
| 5447 |
|
|
| 5448 |
(ab(cd)ef) |
(ab(cd)ef) |
| 5449 |
|
|
| 5450 |
the value for the inner capturing parentheses (numbered 2) is "ef", |
the value for the inner capturing parentheses (numbered 2) is "ef", |
| 5451 |
which is the last value taken on at the top level. If a capturing sub- |
which is the last value taken on at the top level. If a capturing sub- |
| 5452 |
pattern is not matched at the top level, its final captured value is |
pattern is not matched at the top level, its final captured value is |
| 5453 |
unset, even if it was (temporarily) set at a deeper level during the |
unset, even if it was (temporarily) set at a deeper level during the |
| 5454 |
matching process. |
matching process. |
| 5455 |
|
|
| 5456 |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
| 5457 |
to obtain extra memory to store data during a recursion, which it does |
to obtain extra memory to store data during a recursion, which it does |
| 5458 |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
| 5459 |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
| 5460 |
|
|
| 5461 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
| 5462 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
| 5463 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
| 5464 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
| 5465 |
ted at the outer level. |
ted at the outer level. |
| 5466 |
|
|
| 5467 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
| 5468 |
|
|
| 5469 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
| 5470 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
| 5471 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
| 5472 |
|
|
| 5473 |
Differences in recursion processing between PCRE and Perl |
Differences in recursion processing between PCRE and Perl |
| 5474 |
|
|
| 5475 |
Recursion processing in PCRE differs from Perl in two important ways. |
Recursion processing in PCRE differs from Perl in two important ways. |
| 5476 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
| 5477 |
always treated as an atomic group. That is, once it has matched some of |
always treated as an atomic group. That is, once it has matched some of |
| 5478 |
the subject string, it is never re-entered, even if it contains untried |
the subject string, it is never re-entered, even if it contains untried |
| 5479 |
alternatives and there is a subsequent matching failure. This can be |
alternatives and there is a subsequent matching failure. This can be |
| 5480 |
illustrated by the following pattern, which purports to match a palin- |
illustrated by the following pattern, which purports to match a palin- |
| 5481 |
dromic string that contains an odd number of characters (for example, |
dromic string that contains an odd number of characters (for example, |
| 5482 |
"a", "aba", "abcba", "abcdcba"): |
"a", "aba", "abcba", "abcdcba"): |
| 5483 |
|
|
| 5484 |
^(.|(.)(?1)\2)$ |
^(.|(.)(?1)\2)$ |
| 5485 |
|
|
| 5486 |
The idea is that it either matches a single character, or two identical |
The idea is that it either matches a single character, or two identical |
| 5487 |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
| 5488 |
in PCRE it does not if the pattern is longer than three characters. |
in PCRE it does not if the pattern is longer than three characters. |
| 5489 |
Consider the subject string "abcba": |
Consider the subject string "abcba": |
| 5490 |
|
|
| 5491 |
At the top level, the first character is matched, but as it is not at |
At the top level, the first character is matched, but as it is not at |
| 5492 |
the end of the string, the first alternative fails; the second alterna- |
the end of the string, the first alternative fails; the second alterna- |
| 5493 |
tive is taken and the recursion kicks in. The recursive call to subpat- |
tive is taken and the recursion kicks in. The recursive call to subpat- |
| 5494 |
tern 1 successfully matches the next character ("b"). (Note that the |
tern 1 successfully matches the next character ("b"). (Note that the |
| 5495 |
beginning and end of line tests are not part of the recursion). |
beginning and end of line tests are not part of the recursion). |
| 5496 |
|
|
| 5497 |
Back at the top level, the next character ("c") is compared with what |
Back at the top level, the next character ("c") is compared with what |
| 5498 |
subpattern 2 matched, which was "a". This fails. Because the recursion |
subpattern 2 matched, which was "a". This fails. Because the recursion |
| 5499 |
is treated as an atomic group, there are now no backtracking points, |
is treated as an atomic group, there are now no backtracking points, |
| 5500 |
and so the entire match fails. (Perl is able, at this point, to re- |
and so the entire match fails. (Perl is able, at this point, to re- |
| 5501 |
enter the recursion and try the second alternative.) However, if the |
enter the recursion and try the second alternative.) However, if the |
| 5502 |
pattern is written with the alternatives in the other order, things are |
pattern is written with the alternatives in the other order, things are |
| 5503 |
different: |
different: |
| 5504 |
|
|
| 5505 |
^((.)(?1)\2|.)$ |
^((.)(?1)\2|.)$ |
| 5506 |
|
|
| 5507 |
This time, the recursing alternative is tried first, and continues to |
This time, the recursing alternative is tried first, and continues to |
| 5508 |
recurse until it runs out of characters, at which point the recursion |
recurse until it runs out of characters, at which point the recursion |
| 5509 |
fails. But this time we do have another alternative to try at the |
fails. But this time we do have another alternative to try at the |
| 5510 |
higher level. That is the big difference: in the previous case the |
higher level. That is the big difference: in the previous case the |
| 5511 |
remaining alternative is at a deeper recursion level, which PCRE cannot |
remaining alternative is at a deeper recursion level, which PCRE cannot |
| 5512 |
use. |
use. |
| 5513 |
|
|
| 5514 |
To change the pattern so that it matches all palindromic strings, not |
To change the pattern so that it matches all palindromic strings, not |
| 5515 |
just those with an odd number of characters, it is tempting to change |
just those with an odd number of characters, it is tempting to change |
| 5516 |
the pattern to this: |
the pattern to this: |
| 5517 |
|
|
| 5518 |
^((.)(?1)\2|.?)$ |
^((.)(?1)\2|.?)$ |
| 5519 |
|
|
| 5520 |
Again, this works in Perl, but not in PCRE, and for the same reason. |
Again, this works in Perl, but not in PCRE, and for the same reason. |
| 5521 |
When a deeper recursion has matched a single character, it cannot be |
When a deeper recursion has matched a single character, it cannot be |
| 5522 |
entered again in order to match an empty string. The solution is to |
entered again in order to match an empty string. The solution is to |
| 5523 |
separate the two cases, and write out the odd and even cases as alter- |
separate the two cases, and write out the odd and even cases as alter- |
| 5524 |
natives at the higher level: |
natives at the higher level: |
| 5525 |
|
|
| 5526 |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
| 5527 |
|
|
| 5528 |
If you want to match typical palindromic phrases, the pattern has to |
If you want to match typical palindromic phrases, the pattern has to |
| 5529 |
ignore all non-word characters, which can be done like this: |
ignore all non-word characters, which can be done like this: |
| 5530 |
|
|
| 5531 |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
| 5532 |
|
|
| 5533 |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
| 5534 |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
| 5535 |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
| 5536 |
ing into sequences of non-word characters. Without this, PCRE takes a |
ing into sequences of non-word characters. Without this, PCRE takes a |
| 5537 |
great deal longer (ten times or more) to match typical phrases, and |
great deal longer (ten times or more) to match typical phrases, and |
| 5538 |
Perl takes so long that you think it has gone into a loop. |
Perl takes so long that you think it has gone into a loop. |
| 5539 |
|
|
| 5540 |
WARNING: The palindrome-matching patterns above work only if the sub- |
WARNING: The palindrome-matching patterns above work only if the sub- |
| 5541 |
ject string does not start with a palindrome that is shorter than the |
ject string does not start with a palindrome that is shorter than the |
| 5542 |
entire string. For example, although "abcba" is correctly matched, if |
entire string. For example, although "abcba" is correctly matched, if |
| 5543 |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
| 5544 |
then fails at top level because the end of the string does not follow. |
then fails at top level because the end of the string does not follow. |
| 5545 |
Once again, it cannot jump back into the recursion to try other alter- |
Once again, it cannot jump back into the recursion to try other alter- |
| 5546 |
natives, so the entire match fails. |
natives, so the entire match fails. |
| 5547 |
|
|
| 5548 |
The second way in which PCRE and Perl differ in their recursion pro- |
The second way in which PCRE and Perl differ in their recursion pro- |
| 5549 |
cessing is in the handling of captured values. In Perl, when a subpat- |
cessing is in the handling of captured values. In Perl, when a subpat- |
| 5550 |
tern is called recursively or as a subpattern (see the next section), |
tern is called recursively or as a subpattern (see the next section), |
| 5551 |
it has no access to any values that were captured outside the recur- |
it has no access to any values that were captured outside the recur- |
| 5552 |
sion, whereas in PCRE these values can be referenced. Consider this |
sion, whereas in PCRE these values can be referenced. Consider this |
| 5553 |
pattern: |
pattern: |
| 5554 |
|
|
| 5555 |
^(.)(\1|a(?2)) |
^(.)(\1|a(?2)) |
| 5556 |
|
|
| 5557 |
In PCRE, this pattern matches "bab". The first capturing parentheses |
In PCRE, this pattern matches "bab". The first capturing parentheses |
| 5558 |
match "b", then in the second group, when the back reference \1 fails |
match "b", then in the second group, when the back reference \1 fails |
| 5559 |
to match "b", the second alternative matches "a" and then recurses. In |
to match "b", the second alternative matches "a" and then recurses. In |
| 5560 |
the recursion, \1 does now match "b" and so the whole match succeeds. |
the recursion, \1 does now match "b" and so the whole match succeeds. |
| 5561 |
In Perl, the pattern fails to match because inside the recursive call |
In Perl, the pattern fails to match because inside the recursive call |
| 5562 |
\1 cannot access the externally set value. |
\1 cannot access the externally set value. |
| 5563 |
|
|
| 5564 |
|
|
| 5565 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
| 5566 |
|
|
| 5567 |
If the syntax for a recursive subpattern call (either by number or by |
If the syntax for a recursive subpattern call (either by number or by |
| 5568 |
name) is used outside the parentheses to which it refers, it operates |
name) is used outside the parentheses to which it refers, it operates |
| 5569 |
like a subroutine in a programming language. The called subpattern may |
like a subroutine in a programming language. The called subpattern may |
| 5570 |
be defined before or after the reference. A numbered reference can be |
be defined before or after the reference. A numbered reference can be |
| 5571 |
absolute or relative, as in these examples: |
absolute or relative, as in these examples: |
| 5572 |
|
|
| 5573 |
(...(absolute)...)...(?2)... |
(...(absolute)...)...(?2)... |
| 5578 |
|
|
| 5579 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
| 5580 |
|
|
| 5581 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
| 5582 |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
| 5583 |
|
|
| 5584 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
| 5585 |
|
|
| 5586 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
| 5587 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
| 5588 |
above. |
above. |
| 5589 |
|
|
| 5590 |
All subroutine calls, whether recursive or not, are always treated as |
All subroutine calls, whether recursive or not, are always treated as |
| 5591 |
atomic groups. That is, once a subroutine has matched some of the sub- |
atomic groups. That is, once a subroutine has matched some of the sub- |
| 5592 |
ject string, it is never re-entered, even if it contains untried alter- |
ject string, it is never re-entered, even if it contains untried alter- |
| 5593 |
natives and there is a subsequent matching failure. Any capturing |
natives and there is a subsequent matching failure. Any capturing |
| 5594 |
parentheses that are set during the subroutine call revert to their |
parentheses that are set during the subroutine call revert to their |
| 5595 |
previous values afterwards. |
previous values afterwards. |
| 5596 |
|
|
| 5597 |
Processing options such as case-independence are fixed when a subpat- |
Processing options such as case-independence are fixed when a subpat- |
| 5598 |
tern is defined, so if it is used as a subroutine, such options cannot |
tern is defined, so if it is used as a subroutine, such options cannot |
| 5599 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
| 5600 |
|
|
| 5601 |
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
| 5602 |
|
|
| 5603 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
| 5604 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
| 5605 |
|
|
| 5606 |
|
|
| 5607 |
ONIGURUMA SUBROUTINE SYNTAX |
ONIGURUMA SUBROUTINE SYNTAX |
| 5608 |
|
|
| 5609 |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
| 5610 |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
| 5611 |
an alternative syntax for referencing a subpattern as a subroutine, |
an alternative syntax for referencing a subpattern as a subroutine, |
| 5612 |
possibly recursively. Here are two of the examples used above, rewrit- |
possibly recursively. Here are two of the examples used above, rewrit- |
| 5613 |
ten using this syntax: |
ten using this syntax: |
| 5614 |
|
|
| 5615 |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
| 5616 |
(sens|respons)e and \g'1'ibility |
(sens|respons)e and \g'1'ibility |
| 5617 |
|
|
| 5618 |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
| 5619 |
plus or a minus sign it is taken as a relative reference. For example: |
plus or a minus sign it is taken as a relative reference. For example: |
| 5620 |
|
|
| 5621 |
(abc)(?i:\g<-1>) |
(abc)(?i:\g<-1>) |
| 5622 |
|
|
| 5623 |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
| 5624 |
synonymous. The former is a back reference; the latter is a subroutine |
synonymous. The former is a back reference; the latter is a subroutine |
| 5625 |
call. |
call. |
| 5626 |
|
|
| 5627 |
|
|
| 5628 |
CALLOUTS |
CALLOUTS |
| 5629 |
|
|
| 5630 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
| 5631 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
| 5632 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
| 5633 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
| 5634 |
tion. |
tion. |
| 5635 |
|
|
| 5636 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
| 5637 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
| 5638 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
| 5639 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
| 5640 |
all calling out. |
all calling out. |
| 5641 |
|
|
| 5642 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
| 5643 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
| 5644 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
| 5645 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
| 5646 |
points: |
points: |
| 5647 |
|
|
| 5648 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
| 5649 |
|
|
| 5650 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
| 5651 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
| 5652 |
numbered 255. |
numbered 255. |
| 5653 |
|
|
| 5654 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
| 5655 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
| 5656 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
| 5657 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
| 5658 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
| 5659 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
| 5660 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
| 5661 |
|
|
| 5662 |
|
|
| 5663 |
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
| 5664 |
|
|
| 5665 |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
| 5666 |
which are described in the Perl documentation as "experimental and sub- |
which are described in the Perl documentation as "experimental and sub- |
| 5667 |
ject to change or removal in a future version of Perl". It goes on to |
ject to change or removal in a future version of Perl". It goes on to |
| 5668 |
say: "Their usage in production code should be noted to avoid problems |
say: "Their usage in production code should be noted to avoid problems |
| 5669 |
during upgrades." The same remarks apply to the PCRE features described |
during upgrades." The same remarks apply to the PCRE features described |
| 5670 |
in this section. |
in this section. |
| 5671 |
|
|
| 5672 |
Since these verbs are specifically related to backtracking, most of |
Since these verbs are specifically related to backtracking, most of |
| 5673 |
them can be used only when the pattern is to be matched using |
them can be used only when the pattern is to be matched using |
| 5674 |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
| 5675 |
(*FAIL), which behaves like a failing negative assertion, they cause an |
(*FAIL), which behaves like a failing negative assertion, they cause an |
| 5676 |
error if encountered by pcre_dfa_exec(). |
error if encountered by pcre_dfa_exec(). |
| 5677 |
|
|
| 5678 |
If any of these verbs are used in an assertion or in a subpattern that |
If any of these verbs are used in an assertion or in a subpattern that |
| 5679 |
is called as a subroutine (whether or not recursively), their effect is |
is called as a subroutine (whether or not recursively), their effect is |
| 5680 |
confined to that subpattern; it does not extend to the surrounding pat- |
confined to that subpattern; it does not extend to the surrounding pat- |
| 5681 |
tern, with one exception: a *MARK that is encountered in a positive |
tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN) |
| 5682 |
assertion is passed back (compare capturing parentheses in assertions). |
that is encountered in a successful positive assertion is passed back |
| 5683 |
|
when a match succeeds (compare capturing parentheses in assertions). |
| 5684 |
Note that such subpatterns are processed as anchored at the point where |
Note that such subpatterns are processed as anchored at the point where |
| 5685 |
they are tested. Note also that Perl's treatment of subroutines is dif- |
they are tested. Note also that Perl's treatment of subroutines is dif- |
| 5686 |
ferent in some cases. |
ferent in some cases. |
| 5703 |
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com- |
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com- |
| 5704 |
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). |
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). |
| 5705 |
|
|
| 5706 |
|
Experiments with Perl suggest that it too has similar optimizations, |
| 5707 |
|
sometimes leading to anomalous results. |
| 5708 |
|
|
| 5709 |
Verbs that act immediately |
Verbs that act immediately |
| 5710 |
|
|
| 5711 |
The following verbs act as soon as they are encountered. They may not |
The following verbs act as soon as they are encountered. They may not |
| 5712 |
be followed by a name. |
be followed by a name. |
| 5713 |
|
|
| 5714 |
(*ACCEPT) |
(*ACCEPT) |
| 5715 |
|
|
| 5716 |
This verb causes the match to end successfully, skipping the remainder |
This verb causes the match to end successfully, skipping the remainder |
| 5717 |
of the pattern. However, when it is inside a subpattern that is called |
of the pattern. However, when it is inside a subpattern that is called |
| 5718 |
as a subroutine, only that subpattern is ended successfully. Matching |
as a subroutine, only that subpattern is ended successfully. Matching |
| 5719 |
then continues at the outer level. If (*ACCEPT) is inside capturing |
then continues at the outer level. If (*ACCEPT) is inside capturing |
| 5720 |
parentheses, the data so far is captured. For example: |
parentheses, the data so far is captured. For example: |
| 5721 |
|
|
| 5722 |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
| 5723 |
|
|
| 5724 |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
| 5725 |
tured by the outer parentheses. |
tured by the outer parentheses. |
| 5726 |
|
|
| 5727 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
| 5728 |
|
|
| 5729 |
This verb causes a matching failure, forcing backtracking to occur. It |
This verb causes a matching failure, forcing backtracking to occur. It |
| 5730 |
is equivalent to (?!) but easier to read. The Perl documentation notes |
is equivalent to (?!) but easier to read. The Perl documentation notes |
| 5731 |
that it is probably useful only when combined with (?{}) or (??{}). |
that it is probably useful only when combined with (?{}) or (??{}). |
| 5732 |
Those are, of course, Perl features that are not present in PCRE. The |
Those are, of course, Perl features that are not present in PCRE. The |
| 5733 |
nearest equivalent is the callout feature, as for example in this pat- |
nearest equivalent is the callout feature, as for example in this pat- |
| 5734 |
tern: |
tern: |
| 5735 |
|
|
| 5736 |
a+(?C)(*FAIL) |
a+(?C)(*FAIL) |
| 5737 |
|
|
| 5738 |
A match with the string "aaaa" always fails, but the callout is taken |
A match with the string "aaaa" always fails, but the callout is taken |
| 5739 |
before each backtrack happens (in this example, 10 times). |
before each backtrack happens (in this example, 10 times). |
| 5740 |
|
|
| 5741 |
Recording which path was taken |
Recording which path was taken |
| 5742 |
|
|
| 5743 |
There is one verb whose main purpose is to track how a match was |
There is one verb whose main purpose is to track how a match was |
| 5744 |
arrived at, though it also has a secondary use in conjunction with |
arrived at, though it also has a secondary use in conjunction with |
| 5745 |
advancing the match starting point (see (*SKIP) below). |
advancing the match starting point (see (*SKIP) below). |
| 5746 |
|
|
| 5747 |
(*MARK:NAME) or (*:NAME) |
(*MARK:NAME) or (*:NAME) |
| 5748 |
|
|
| 5749 |
A name is always required with this verb. There may be as many |
A name is always required with this verb. There may be as many |
| 5750 |
instances of (*MARK) as you like in a pattern, and their names do not |
instances of (*MARK) as you like in a pattern, and their names do not |
| 5751 |
have to be unique. |
have to be unique. |
| 5752 |
|
|
| 5753 |
When a match succeeds, the name of the last-encountered (*MARK) is |
When a match succeeds, the name of the last-encountered (*MARK) on the |
| 5754 |
passed back to the caller via the pcre_extra data structure, as |
matching path is passed back to the caller via the pcre_extra data |
| 5755 |
described in the section on pcre_extra in the pcreapi documentation. No |
structure, as described in the section on pcre_extra in the pcreapi |
| 5756 |
data is returned for a partial match. Here is an example of pcretest |
documentation. Here is an example of pcretest output, where the /K mod- |
| 5757 |
output, where the /K modifier requests the retrieval and outputting of |
ifier requests the retrieval and outputting of (*MARK) data: |
|
(*MARK) data: |
|
| 5758 |
|
|
| 5759 |
/X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
| 5760 |
XY |
data> XY |
| 5761 |
0: XY |
0: XY |
| 5762 |
MK: A |
MK: A |
| 5763 |
XZ |
XZ |
| 5773 |
and passed back if it is the last-encountered. This does not happen for |
and passed back if it is the last-encountered. This does not happen for |
| 5774 |
negative assertions. |
negative assertions. |
| 5775 |
|
|
| 5776 |
A name may also be returned after a failed match if the final path |
After a partial match or a failed match, the name of the last encoun- |
| 5777 |
through the pattern involves (*MARK). However, unless (*MARK) used in |
tered (*MARK) in the entire match process is returned. For example: |
|
conjunction with (*COMMIT), this is unlikely to happen for an unan- |
|
|
chored pattern because, as the starting point for matching is advanced, |
|
|
the final check is often with an empty string, causing a failure before |
|
|
(*MARK) is reached. For example: |
|
|
|
|
|
/X(*MARK:A)Y|X(*MARK:B)Z/K |
|
|
XP |
|
|
No match |
|
|
|
|
|
There are three potential starting points for this match (starting with |
|
|
X, starting with P, and with an empty string). If the pattern is |
|
|
anchored, the result is different: |
|
| 5778 |
|
|
| 5779 |
/^X(*MARK:A)Y|^X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
| 5780 |
XP |
data> XP |
| 5781 |
No match, mark = B |
No match, mark = B |
| 5782 |
|
|
| 5783 |
PCRE's start-of-match optimizations can also interfere with this. For |
Note that in this unanchored example the mark is retained from the |
| 5784 |
example, if, as a result of a call to pcre_study(), it knows the mini- |
match attempt that started at the letter "X". Subsequent match attempts |
| 5785 |
mum subject length for a match, a shorter subject will not be scanned |
starting at "P" and then with an empty string do not get as far as the |
| 5786 |
at all. |
(*MARK) item, but nevertheless do not reset it. |
|
|
|
|
Note that similar anomalies (though different in detail) exist in Perl, |
|
|
no doubt for the same reasons. The use of (*MARK) data after a failed |
|
|
match of an unanchored pattern is not recommended, unless (*COMMIT) is |
|
|
involved. |
|
| 5787 |
|
|
| 5788 |
Verbs that act after backtracking |
Verbs that act after backtracking |
| 5789 |
|
|
| 5790 |
The following verbs do nothing when they are encountered. Matching con- |
The following verbs do nothing when they are encountered. Matching con- |
| 5791 |
tinues with what follows, but if there is no subsequent match, causing |
tinues with what follows, but if there is no subsequent match, causing |
| 5792 |
a backtrack to the verb, a failure is forced. That is, backtracking |
a backtrack to the verb, a failure is forced. That is, backtracking |
| 5793 |
cannot pass to the left of the verb. However, when one of these verbs |
cannot pass to the left of the verb. However, when one of these verbs |
| 5794 |
appears inside an atomic group, its effect is confined to that group, |
appears inside an atomic group, its effect is confined to that group, |
| 5795 |
because once the group has been matched, there is never any backtrack- |
because once the group has been matched, there is never any backtrack- |
| 5796 |
ing into it. In this situation, backtracking can "jump back" to the |
ing into it. In this situation, backtracking can "jump back" to the |
| 5797 |
left of the entire atomic group. (Remember also, as stated above, that |
left of the entire atomic group. (Remember also, as stated above, that |
| 5798 |
this localization also applies in subroutine calls and assertions.) |
this localization also applies in subroutine calls and assertions.) |
| 5799 |
|
|
| 5800 |
These verbs differ in exactly what kind of failure occurs when back- |
These verbs differ in exactly what kind of failure occurs when back- |
| 5801 |
tracking reaches them. |
tracking reaches them. |
| 5802 |
|
|
| 5803 |
(*COMMIT) |
(*COMMIT) |
| 5804 |
|
|
| 5805 |
This verb, which may not be followed by a name, causes the whole match |
This verb, which may not be followed by a name, causes the whole match |
| 5806 |
to fail outright if the rest of the pattern does not match. Even if the |
to fail outright if the rest of the pattern does not match. Even if the |
| 5807 |
pattern is unanchored, no further attempts to find a match by advancing |
pattern is unanchored, no further attempts to find a match by advancing |
| 5808 |
the starting point take place. Once (*COMMIT) has been passed, |
the starting point take place. Once (*COMMIT) has been passed, |
| 5809 |
pcre_exec() is committed to finding a match at the current starting |
pcre_exec() is committed to finding a match at the current starting |
| 5810 |
point, or not at all. For example: |
point, or not at all. For example: |
| 5811 |
|
|
| 5812 |
a+(*COMMIT)b |
a+(*COMMIT)b |
| 5813 |
|
|
| 5814 |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
| 5815 |
of dynamic anchor, or "I've started, so I must finish." The name of the |
of dynamic anchor, or "I've started, so I must finish." The name of the |
| 5816 |
most recently passed (*MARK) in the path is passed back when (*COMMIT) |
most recently passed (*MARK) in the path is passed back when (*COMMIT) |
| 5817 |
forces a match failure. |
forces a match failure. |
| 5818 |
|
|
| 5819 |
Note that (*COMMIT) at the start of a pattern is not the same as an |
Note that (*COMMIT) at the start of a pattern is not the same as an |
| 5820 |
anchor, unless PCRE's start-of-match optimizations are turned off, as |
anchor, unless PCRE's start-of-match optimizations are turned off, as |
| 5821 |
shown in this pcretest example: |
shown in this pcretest example: |
| 5822 |
|
|
| 5823 |
/(*COMMIT)abc/ |
re> /(*COMMIT)abc/ |
| 5824 |
xyzabc |
data> xyzabc |
| 5825 |
0: abc |
0: abc |
| 5826 |
xyzabc\Y |
xyzabc\Y |
| 5827 |
No match |
No match |
| 5828 |
|
|
| 5829 |
PCRE knows that any match must start with "a", so the optimization |
PCRE knows that any match must start with "a", so the optimization |
| 5830 |
skips along the subject to "a" before running the first match attempt, |
skips along the subject to "a" before running the first match attempt, |
| 5831 |
which succeeds. When the optimization is disabled by the \Y escape in |
which succeeds. When the optimization is disabled by the \Y escape in |
| 5832 |
the second subject, the match starts at "x" and so the (*COMMIT) causes |
the second subject, the match starts at "x" and so the (*COMMIT) causes |
| 5833 |
it to fail without trying any other starting points. |
it to fail without trying any other starting points. |
| 5834 |
|
|
| 5835 |
(*PRUNE) or (*PRUNE:NAME) |
(*PRUNE) or (*PRUNE:NAME) |
| 5836 |
|
|
| 5837 |
This verb causes the match to fail at the current starting position in |
This verb causes the match to fail at the current starting position in |
| 5838 |
the subject if the rest of the pattern does not match. If the pattern |
the subject if the rest of the pattern does not match. If the pattern |
| 5839 |
is unanchored, the normal "bumpalong" advance to the next starting |
is unanchored, the normal "bumpalong" advance to the next starting |
| 5840 |
character then happens. Backtracking can occur as usual to the left of |
character then happens. Backtracking can occur as usual to the left of |
| 5841 |
(*PRUNE), before it is reached, or when matching to the right of |
(*PRUNE), before it is reached, or when matching to the right of |
| 5842 |
(*PRUNE), but if there is no match to the right, backtracking cannot |
(*PRUNE), but if there is no match to the right, backtracking cannot |
| 5843 |
cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter- |
cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter- |
| 5844 |
native to an atomic group or possessive quantifier, but there are some |
native to an atomic group or possessive quantifier, but there are some |
| 5845 |
uses of (*PRUNE) that cannot be expressed in any other way. The behav- |
uses of (*PRUNE) that cannot be expressed in any other way. The behav- |
| 5846 |
iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the |
iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an |
| 5847 |
match fails completely; the name is passed back if this is the final |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). |
|
attempt. (*PRUNE:NAME) does not pass back a name if the match suc- |
|
|
ceeds. In an anchored pattern (*PRUNE) has the same effect as (*COM- |
|
|
MIT). |
|
| 5848 |
|
|
| 5849 |
(*SKIP) |
(*SKIP) |
| 5850 |
|
|
| 5871 |
is searched for the most recent (*MARK) that has the same name. If one |
is searched for the most recent (*MARK) that has the same name. If one |
| 5872 |
is found, the "bumpalong" advance is to the subject position that cor- |
is found, the "bumpalong" advance is to the subject position that cor- |
| 5873 |
responds to that (*MARK) instead of to where (*SKIP) was encountered. |
responds to that (*MARK) instead of to where (*SKIP) was encountered. |
| 5874 |
If no (*MARK) with a matching name is found, normal "bumpalong" of one |
If no (*MARK) with a matching name is found, the (*SKIP) is ignored. |
|
character happens (that is, the (*SKIP) is ignored). |
|
| 5875 |
|
|
| 5876 |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
| 5877 |
|
|
| 5878 |
This verb causes a skip to the next innermost alternative if the rest |
This verb causes a skip to the next innermost alternative if the rest |
| 5879 |
of the pattern does not match. That is, it cancels pending backtrack- |
of the pattern does not match. That is, it cancels pending backtrack- |
| 5880 |
ing, but only within the current alternative. Its name comes from the |
ing, but only within the current alternative. Its name comes from the |
| 5881 |
observation that it can be used for a pattern-based if-then-else block: |
observation that it can be used for a pattern-based if-then-else block: |
| 5882 |
|
|
| 5883 |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
| 5884 |
|
|
| 5885 |
If the COND1 pattern matches, FOO is tried (and possibly further items |
If the COND1 pattern matches, FOO is tried (and possibly further items |
| 5886 |
after the end of the group if FOO succeeds); on failure, the matcher |
after the end of the group if FOO succeeds); on failure, the matcher |
| 5887 |
skips to the second alternative and tries COND2, without backtracking |
skips to the second alternative and tries COND2, without backtracking |
| 5888 |
into COND1. The behaviour of (*THEN:NAME) is exactly the same as |
into COND1. The behaviour of (*THEN:NAME) is exactly the same as |
| 5889 |
(*MARK:NAME)(*THEN) if the overall match fails. If (*THEN) is not |
(*MARK:NAME)(*THEN). If (*THEN) is not inside an alternation, it acts |
| 5890 |
inside an alternation, it acts like (*PRUNE). |
like (*PRUNE). |
| 5891 |
|
|
| 5892 |
Note that a subpattern that does not contain a | character is just a |
Note that a subpattern that does not contain a | character is just a |
| 5893 |
part of the enclosing alternative; it is not a nested alternation with |
part of the enclosing alternative; it is not a nested alternation with |
| 5894 |
only one alternative. The effect of (*THEN) extends beyond such a sub- |
only one alternative. The effect of (*THEN) extends beyond such a sub- |
| 5895 |
pattern to the enclosing alternative. Consider this pattern, where A, |
pattern to the enclosing alternative. Consider this pattern, where A, |
| 5896 |
B, etc. are complex pattern fragments that do not contain any | charac- |
B, etc. are complex pattern fragments that do not contain any | charac- |
| 5897 |
ters at this level: |
ters at this level: |
| 5898 |
|
|
| 5899 |
A (B(*THEN)C) | D |
A (B(*THEN)C) | D |
| 5900 |
|
|
| 5901 |
If A and B are matched, but there is a failure in C, matching does not |
If A and B are matched, but there is a failure in C, matching does not |
| 5902 |
backtrack into A; instead it moves to the next alternative, that is, D. |
backtrack into A; instead it moves to the next alternative, that is, D. |
| 5903 |
However, if the subpattern containing (*THEN) is given an alternative, |
However, if the subpattern containing (*THEN) is given an alternative, |
| 5904 |
it behaves differently: |
it behaves differently: |
| 5905 |
|
|
| 5906 |
A (B(*THEN)C | (*FAIL)) | D |
A (B(*THEN)C | (*FAIL)) | D |
| 5907 |
|
|
| 5908 |
The effect of (*THEN) is now confined to the inner subpattern. After a |
The effect of (*THEN) is now confined to the inner subpattern. After a |
| 5909 |
failure in C, matching moves to (*FAIL), which causes the whole subpat- |
failure in C, matching moves to (*FAIL), which causes the whole subpat- |
| 5910 |
tern to fail because there are no more alternatives to try. In this |
tern to fail because there are no more alternatives to try. In this |
| 5911 |
case, matching does now backtrack into A. |
case, matching does now backtrack into A. |
| 5912 |
|
|
| 5913 |
Note also that a conditional subpattern is not considered as having two |
Note also that a conditional subpattern is not considered as having two |
| 5914 |
alternatives, because only one is ever used. In other words, the | |
alternatives, because only one is ever used. In other words, the | |
| 5915 |
character in a conditional subpattern has a different meaning. Ignoring |
character in a conditional subpattern has a different meaning. Ignoring |
| 5916 |
white space, consider: |
white space, consider: |
| 5917 |
|
|
| 5918 |
^.*? (?(?=a) a | b(*THEN)c ) |
^.*? (?(?=a) a | b(*THEN)c ) |
| 5919 |
|
|
| 5920 |
If the subject is "ba", this pattern does not match. Because .*? is |
If the subject is "ba", this pattern does not match. Because .*? is |
| 5921 |
ungreedy, it initially matches zero characters. The condition (?=a) |
ungreedy, it initially matches zero characters. The condition (?=a) |
| 5922 |
then fails, the character "b" is matched, but "c" is not. At this |
then fails, the character "b" is matched, but "c" is not. At this |
| 5923 |
point, matching does not backtrack to .*? as might perhaps be expected |
point, matching does not backtrack to .*? as might perhaps be expected |
| 5924 |
from the presence of the | character. The conditional subpattern is |
from the presence of the | character. The conditional subpattern is |
| 5925 |
part of the single alternative that comprises the whole pattern, and so |
part of the single alternative that comprises the whole pattern, and so |
| 5926 |
the match fails. (If there was a backtrack into .*?, allowing it to |
the match fails. (If there was a backtrack into .*?, allowing it to |
| 5927 |
match "b", the match would succeed.) |
match "b", the match would succeed.) |
| 5928 |
|
|
| 5929 |
The verbs just described provide four different "strengths" of control |
The verbs just described provide four different "strengths" of control |
| 5930 |
when subsequent matching fails. (*THEN) is the weakest, carrying on the |
when subsequent matching fails. (*THEN) is the weakest, carrying on the |
| 5931 |
match at the next alternative. (*PRUNE) comes next, failing the match |
match at the next alternative. (*PRUNE) comes next, failing the match |
| 5932 |
at the current starting position, but allowing an advance to the next |
at the current starting position, but allowing an advance to the next |
| 5933 |
character (for an unanchored pattern). (*SKIP) is similar, except that |
character (for an unanchored pattern). (*SKIP) is similar, except that |
| 5934 |
the advance may be more than one character. (*COMMIT) is the strongest, |
the advance may be more than one character. (*COMMIT) is the strongest, |
| 5935 |
causing the entire match to fail. |
causing the entire match to fail. |
| 5936 |
|
|
| 5940 |
|
|
| 5941 |
(A(*COMMIT)B(*THEN)C|D) |
(A(*COMMIT)B(*THEN)C|D) |
| 5942 |
|
|
| 5943 |
Once A has matched, PCRE is committed to this match, at the current |
Once A has matched, PCRE is committed to this match, at the current |
| 5944 |
starting position. If subsequently B matches, but C does not, the nor- |
starting position. If subsequently B matches, but C does not, the nor- |
| 5945 |
mal (*THEN) action of trying the next alternative (that is, D) does not |
mal (*THEN) action of trying the next alternative (that is, D) does not |
| 5946 |
happen because (*COMMIT) overrides. |
happen because (*COMMIT) overrides. |
| 5947 |
|
|
| 5960 |
|
|
| 5961 |
REVISION |
REVISION |
| 5962 |
|
|
| 5963 |
Last updated: 19 October 2011 |
Last updated: 29 November 2011 |
| 5964 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
| 5965 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5966 |
|
|
| 6529 |
been fully tested. If --enable-jit is set on an unsupported platform, |
been fully tested. If --enable-jit is set on an unsupported platform, |
| 6530 |
compilation fails. |
compilation fails. |
| 6531 |
|
|
| 6532 |
A program can tell if JIT support is available by calling pcre_config() |
A program that is linked with PCRE 8.20 or later can tell if JIT sup- |
| 6533 |
with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available, |
port is available by calling pcre_config() with the PCRE_CONFIG_JIT |
| 6534 |
and 0 otherwise. However, a simple program does not need to check this |
option. The result is 1 when JIT is available, and 0 otherwise. How- |
| 6535 |
in order to use JIT. The API is implemented in a way that falls back to |
ever, a simple program does not need to check this in order to use JIT. |
| 6536 |
the ordinary PCRE code if JIT is not available. |
The API is implemented in a way that falls back to the ordinary PCRE |
| 6537 |
|
code if JIT is not available. |
| 6538 |
|
|
| 6539 |
|
If your program may sometimes be linked with versions of PCRE that are |
| 6540 |
|
older than 8.20, but you want to use JIT when it is available, you can |
| 6541 |
|
test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT |
| 6542 |
|
macro such as PCRE_CONFIG_JIT, for compile-time control of your code. |
| 6543 |
|
|
| 6544 |
|
|
| 6545 |
SIMPLE USE OF JIT |
SIMPLE USE OF JIT |
| 6555 |
no longer needed instead of just freeing it yourself. This |
no longer needed instead of just freeing it yourself. This |
| 6556 |
ensures that any JIT data is also freed. |
ensures that any JIT data is also freed. |
| 6557 |
|
|
| 6558 |
|
For a program that may be linked with pre-8.20 versions of PCRE, you |
| 6559 |
|
can insert |
| 6560 |
|
|
| 6561 |
|
#ifndef PCRE_STUDY_JIT_COMPILE |
| 6562 |
|
#define PCRE_STUDY_JIT_COMPILE 0 |
| 6563 |
|
#endif |
| 6564 |
|
|
| 6565 |
|
so that no option is passed to pcre_study(), and then use something |
| 6566 |
|
like this to free the study data: |
| 6567 |
|
|
| 6568 |
|
#ifdef PCRE_CONFIG_JIT |
| 6569 |
|
pcre_free_study(study_ptr); |
| 6570 |
|
#else |
| 6571 |
|
pcre_free(study_ptr); |
| 6572 |
|
#endif |
| 6573 |
|
|
| 6574 |
In some circumstances you may need to call additional functions. These |
In some circumstances you may need to call additional functions. These |
| 6575 |
are described in the section entitled "Controlling the JIT stack" |
are described in the section entitled "Controlling the JIT stack" |
| 6576 |
below. |
below. |
| 6609 |
|
|
| 6610 |
The unsupported pattern items are: |
The unsupported pattern items are: |
| 6611 |
|
|
| 6612 |
\C match a single byte; not supported in UTF-8 mode |
\C match a single byte; not supported in UTF-8 mode |
| 6613 |
(?Cn) callouts |
(?Cn) callouts |
|
(?(<name>)... conditional test on setting of a named subpattern |
|
|
(?(R)... conditional test on whole pattern recursion |
|
|
(?(Rn)... conditional test on recursion, by number |
|
|
(?(R&name)... conditional test on recursion, by name |
|
| 6614 |
(*COMMIT) ) |
(*COMMIT) ) |
| 6615 |
(*MARK) ) |
(*MARK) ) |
| 6616 |
(*PRUNE) ) the backtracking control verbs |
(*PRUNE) ) the backtracking control verbs |
| 6659 |
large or complicated patterns need more than this. The error |
large or complicated patterns need more than this. The error |
| 6660 |
PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack. |
PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack. |
| 6661 |
Three functions are provided for managing blocks of memory for use as |
Three functions are provided for managing blocks of memory for use as |
| 6662 |
JIT stacks. |
JIT stacks. There is further discussion about the use of JIT stacks in |
| 6663 |
|
the section entitled "JIT stack FAQ" below. |
| 6664 |
|
|
| 6665 |
The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments |
The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments |
| 6666 |
are a starting size and a maximum size, and it returns a pointer to an |
are a starting size and a maximum size, and it returns a pointer to an |
| 6667 |
opaque structure of type pcre_jit_stack, or NULL if there is an error. |
opaque structure of type pcre_jit_stack, or NULL if there is an error. |
| 6668 |
The pcre_jit_stack_free() function can be used to free a stack that is |
The pcre_jit_stack_free() function can be used to free a stack that is |
| 6669 |
no longer needed. (For the technically minded: the address space is |
no longer needed. (For the technically minded: the address space is |
| 6670 |
allocated by mmap or VirtualAlloc.) |
allocated by mmap or VirtualAlloc.) |
| 6671 |
|
|
| 6672 |
JIT uses far less memory for recursion than the interpretive code, and |
JIT uses far less memory for recursion than the interpretive code, and |
| 6673 |
a maximum stack size of 512K to 1M should be more than enough for any |
a maximum stack size of 512K to 1M should be more than enough for any |
| 6674 |
pattern. |
pattern. |
| 6675 |
|
|
| 6676 |
The pcre_assign_jit_stack() function specifies which stack JIT code |
The pcre_assign_jit_stack() function specifies which stack JIT code |
| 6677 |
should use. Its arguments are as follows: |
should use. Its arguments are as follows: |
| 6678 |
|
|
| 6679 |
pcre_extra *extra |
pcre_extra *extra |
| 6680 |
pcre_jit_callback callback |
pcre_jit_callback callback |
| 6681 |
void *data |
void *data |
| 6682 |
|
|
| 6683 |
The extra argument must be the result of studying a pattern with |
The extra argument must be the result of studying a pattern with |
| 6684 |
PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the |
PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the |
| 6685 |
other two options: |
other two options: |
| 6686 |
|
|
| 6687 |
(1) If callback is NULL and data is NULL, an internal 32K block |
(1) If callback is NULL and data is NULL, an internal 32K block |
| 6696 |
is used; otherwise the return value must be a valid JIT stack, |
is used; otherwise the return value must be a valid JIT stack, |
| 6697 |
the result of calling pcre_jit_stack_alloc(). |
the result of calling pcre_jit_stack_alloc(). |
| 6698 |
|
|
| 6699 |
You may safely assign the same JIT stack to more than one pattern, as |
You may safely assign the same JIT stack to more than one pattern, as |
| 6700 |
long as they are all matched sequentially in the same thread. In a mul- |
long as they are all matched sequentially in the same thread. In a mul- |
| 6701 |
tithread application, each thread must use its own JIT stack. |
tithread application, each thread must use its own JIT stack. |
| 6702 |
|
|
| 6703 |
Strictly speaking, even more is allowed. You can assign the same stack |
Strictly speaking, even more is allowed. You can assign the same stack |
| 6704 |
to any number of patterns as long as they are not used for matching by |
to any number of patterns as long as they are not used for matching by |
| 6705 |
multiple threads at the same time. For example, you can assign the same |
multiple threads at the same time. For example, you can assign the same |
| 6706 |
stack to all compiled patterns, and use a global mutex in the callback |
stack to all compiled patterns, and use a global mutex in the callback |
| 6707 |
to wait until the stack is available for use. However, this is an inef- |
to wait until the stack is available for use. However, this is an inef- |
| 6708 |
ficient solution, and not recommended. |
ficient solution, and not recommended. |
| 6709 |
|
|
| 6710 |
This is a suggestion for how a typical multithreaded program might |
This is a suggestion for how a typical multithreaded program might |
| 6711 |
operate: |
operate: |
| 6712 |
|
|
| 6713 |
During thread initalization |
During thread initalization |
| 6719 |
Use a one-line callback function |
Use a one-line callback function |
| 6720 |
return thread_local_var |
return thread_local_var |
| 6721 |
|
|
| 6722 |
All the functions described in this section do nothing if JIT is not |
All the functions described in this section do nothing if JIT is not |
| 6723 |
available, and pcre_assign_jit_stack() does nothing unless the extra |
available, and pcre_assign_jit_stack() does nothing unless the extra |
| 6724 |
argument is non-NULL and points to a pcre_extra block that is the |
argument is non-NULL and points to a pcre_extra block that is the |
| 6725 |
result of a successful study with PCRE_STUDY_JIT_COMPILE. |
result of a successful study with PCRE_STUDY_JIT_COMPILE. |
| 6726 |
|
|
| 6727 |
|
|
| 6728 |
|
JIT STACK FAQ |
| 6729 |
|
|
| 6730 |
|
(1) Why do we need JIT stacks? |
| 6731 |
|
|
| 6732 |
|
PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack |
| 6733 |
|
where the local data of the current node is pushed before checking its |
| 6734 |
|
child nodes. Allocating real machine stack on some platforms is diffi- |
| 6735 |
|
cult. For example, the stack chain needs to be updated every time if we |
| 6736 |
|
extend the stack on PowerPC. Although it is possible, its updating |
| 6737 |
|
time overhead decreases performance. So we do the recursion in memory. |
| 6738 |
|
|
| 6739 |
|
(2) Why don't we simply allocate blocks of memory with malloc()? |
| 6740 |
|
|
| 6741 |
|
Modern operating systems have a nice feature: they can reserve an |
| 6742 |
|
address space instead of allocating memory. We can safely allocate mem- |
| 6743 |
|
ory pages inside this address space, so the stack could grow without |
| 6744 |
|
moving memory data (this is important because of pointers). Thus we can |
| 6745 |
|
allocate 1M address space, and use only a single memory page (usually |
| 6746 |
|
4K) if that is enough. However, we can still grow up to 1M anytime if |
| 6747 |
|
needed. |
| 6748 |
|
|
| 6749 |
|
(3) Who "owns" a JIT stack? |
| 6750 |
|
|
| 6751 |
|
The owner of the stack is the user program, not the JIT studied pattern |
| 6752 |
|
or anything else. The user program must ensure that if a stack is used |
| 6753 |
|
by pcre_exec(), (that is, it is assigned to the pattern currently run- |
| 6754 |
|
ning), that stack must not be used by any other threads (to avoid over- |
| 6755 |
|
writing the same memory area). The best practice for multithreaded pro- |
| 6756 |
|
grams is to allocate a stack for each thread, and return this stack |
| 6757 |
|
through the JIT callback function. |
| 6758 |
|
|
| 6759 |
|
(4) When should a JIT stack be freed? |
| 6760 |
|
|
| 6761 |
|
You can free a JIT stack at any time, as long as it will not be used by |
| 6762 |
|
pcre_exec() again. When you assign the stack to a pattern, only a |
| 6763 |
|
pointer is set. There is no reference counting or any other magic. You |
| 6764 |
|
can free the patterns and stacks in any order, anytime. Just do not |
| 6765 |
|
call pcre_exec() with a pattern pointing to an already freed stack, as |
| 6766 |
|
that will cause SEGFAULT. (Also, do not free a stack currently used by |
| 6767 |
|
pcre_exec() in another thread). You can also replace the stack for a |
| 6768 |
|
pattern at any time. You can even free the previous stack before |
| 6769 |
|
assigning a replacement. |
| 6770 |
|
|
| 6771 |
|
(5) Should I allocate/free a stack every time before/after calling |
| 6772 |
|
pcre_exec()? |
| 6773 |
|
|
| 6774 |
|
No, because this is too costly in terms of resources. However, you |
| 6775 |
|
could implement some clever idea which release the stack if it is not |
| 6776 |
|
used in let's say two minutes. The JIT callback can help to achive this |
| 6777 |
|
without keeping a list of the currently JIT studied patterns. |
| 6778 |
|
|
| 6779 |
|
(6) OK, the stack is for long term memory allocation. But what happens |
| 6780 |
|
if a pattern causes stack overflow with a stack of 1M? Is that 1M kept |
| 6781 |
|
until the stack is freed? |
| 6782 |
|
|
| 6783 |
|
Especially on embedded sytems, it might be a good idea to release mem- |
| 6784 |
|
ory sometimes without freeing the stack. There is no API for this at |
| 6785 |
|
the moment. Probably a function call which returns with the currently |
| 6786 |
|
allocated memory for any stack and another which allows releasing mem- |
| 6787 |
|
ory (shrinking the stack) would be a good idea if someone needs this. |
| 6788 |
|
|
| 6789 |
|
(7) This is too much of a headache. Isn't there any better solution for |
| 6790 |
|
JIT stack handling? |
| 6791 |
|
|
| 6792 |
|
No, thanks to Windows. If POSIX threads were used everywhere, we could |
| 6793 |
|
throw out this complicated API. |
| 6794 |
|
|
| 6795 |
|
|
| 6796 |
EXAMPLE CODE |
EXAMPLE CODE |
| 6797 |
|
|
| 6798 |
This is a single-threaded example that specifies a JIT stack without |
This is a single-threaded example that specifies a JIT stack without |
| 6824 |
|
|
| 6825 |
AUTHOR |
AUTHOR |
| 6826 |
|
|
| 6827 |
Philip Hazel |
Philip Hazel (FAQ by Zoltan Herczeg) |
| 6828 |
University Computing Service |
University Computing Service |
| 6829 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
| 6830 |
|
|
| 6831 |
|
|
| 6832 |
REVISION |
REVISION |
| 6833 |
|
|
| 6834 |
Last updated: 19 October 2011 |
Last updated: 26 November 2011 |
| 6835 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
| 6836 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 6837 |
|
|
| 8272 |
There is no limit to the number of parenthesized subpatterns, but there |
There is no limit to the number of parenthesized subpatterns, but there |
| 8273 |
can be no more than 65535 capturing subpatterns. |
can be no more than 65535 capturing subpatterns. |
| 8274 |
|
|
| 8275 |
|
There is a limit to the number of forward references to subsequent sub- |
| 8276 |
|
patterns of around 200,000. Repeated forward references with fixed |
| 8277 |
|
upper limits, for example, (?2){0,100} when subpattern number 2 is to |
| 8278 |
|
the right, are included in the count. There is no limit to the number |
| 8279 |
|
of backward references. |
| 8280 |
|
|
| 8281 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
| 8282 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
| 8283 |
|
|
| 8298 |
|
|
| 8299 |
REVISION |
REVISION |
| 8300 |
|
|
| 8301 |
Last updated: 24 August 2011 |
Last updated: 30 November 2011 |
| 8302 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
| 8303 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 8304 |
|
|