| 138 |
Last updated: 10 January 2012 |
Last updated: 10 January 2012 |
| 139 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 140 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 141 |
|
|
| 142 |
|
|
| 143 |
PCRE(3) PCRE(3) |
PCRE(3) PCRE(3) |
| 144 |
|
|
| 145 |
|
|
| 464 |
Last updated: 14 April 2012 |
Last updated: 14 April 2012 |
| 465 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 466 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 467 |
|
|
| 468 |
|
|
| 469 |
PCREBUILD(3) PCREBUILD(3) |
PCREBUILD(3) PCREBUILD(3) |
| 470 |
|
|
| 471 |
|
|
| 568 |
tern compiling functions. |
tern compiling functions. |
| 569 |
|
|
| 570 |
If you set --enable-utf when compiling in an EBCDIC environment, PCRE |
If you set --enable-utf when compiling in an EBCDIC environment, PCRE |
| 571 |
expects its input to be either ASCII or UTF-8 (depending on the runtime |
expects its input to be either ASCII or UTF-8 (depending on the run- |
| 572 |
option). It is not possible to support both EBCDIC and UTF-8 codes in |
time option). It is not possible to support both EBCDIC and UTF-8 codes |
| 573 |
the same version of the library. Consequently, --enable-utf and |
in the same version of the library. Consequently, --enable-utf and |
| 574 |
--enable-ebcdic are mutually exclusive. |
--enable-ebcdic are mutually exclusive. |
| 575 |
|
|
| 576 |
|
|
| 761 |
to the configure command, the distributed tables are no longer used. |
to the configure command, the distributed tables are no longer used. |
| 762 |
Instead, a program called dftables is compiled and run. This outputs |
Instead, a program called dftables is compiled and run. This outputs |
| 763 |
the source for new set of tables, created in the default locale of your |
the source for new set of tables, created in the default locale of your |
| 764 |
C runtime system. (This method of replacing the tables does not work if |
C run-time system. (This method of replacing the tables does not work |
| 765 |
you are cross compiling, because dftables is run on the local host. If |
if you are cross compiling, because dftables is run on the local host. |
| 766 |
you need to create alternative tables when cross compiling, you will |
If you need to create alternative tables when cross compiling, you will |
| 767 |
have to do so "by hand".) |
have to do so "by hand".) |
| 768 |
|
|
| 769 |
|
|
| 860 |
Last updated: 07 January 2012 |
Last updated: 07 January 2012 |
| 861 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 862 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 863 |
|
|
| 864 |
|
|
| 865 |
PCREMATCHING(3) PCREMATCHING(3) |
PCREMATCHING(3) PCREMATCHING(3) |
| 866 |
|
|
| 867 |
|
|
| 1067 |
Last updated: 08 January 2012 |
Last updated: 08 January 2012 |
| 1068 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 1069 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 1070 |
|
|
| 1071 |
|
|
| 1072 |
PCREAPI(3) PCREAPI(3) |
PCREAPI(3) PCREAPI(3) |
| 1073 |
|
|
| 1074 |
|
|
| 1311 |
feed) character, the two-character sequence CRLF, any of the three pre- |
feed) character, the two-character sequence CRLF, any of the three pre- |
| 1312 |
ceding, or any Unicode newline sequence. The Unicode newline sequences |
ceding, or any Unicode newline sequence. The Unicode newline sequences |
| 1313 |
are the three just mentioned, plus the single characters VT (vertical |
are the three just mentioned, plus the single characters VT (vertical |
| 1314 |
tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line |
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line |
| 1315 |
separator, U+2028), and PS (paragraph separator, U+2029). |
separator, U+2028), and PS (paragraph separator, U+2029). |
| 1316 |
|
|
| 1317 |
Each of the first three conventions is used by at least one operating |
Each of the first three conventions is used by at least one operating |
| 1625 |
|
|
| 1626 |
PCRE_EXTENDED |
PCRE_EXTENDED |
| 1627 |
|
|
| 1628 |
If this bit is set, whitespace data characters in the pattern are |
If this bit is set, white space data characters in the pattern are |
| 1629 |
totally ignored except when escaped or inside a character class. White- |
totally ignored except when escaped or inside a character class. White |
| 1630 |
space does not include the VT character (code 11). In addition, charac- |
space does not include the VT character (code 11). In addition, charac- |
| 1631 |
ters between an unescaped # outside a character class and the next new- |
ters between an unescaped # outside a character class and the next new- |
| 1632 |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
| 1642 |
|
|
| 1643 |
This option makes it possible to include comments inside complicated |
This option makes it possible to include comments inside complicated |
| 1644 |
patterns. Note, however, that this applies only to data characters. |
patterns. Note, however, that this applies only to data characters. |
| 1645 |
Whitespace characters may never appear within special character |
White space characters may never appear within special character |
| 1646 |
sequences in a pattern, for example within the sequence (?( that intro- |
sequences in a pattern, for example within the sequence (?( that intro- |
| 1647 |
duces a conditional subpattern. |
duces a conditional subpattern. |
| 1648 |
|
|
| 1727 |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
| 1728 |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
| 1729 |
recognized. The Unicode newline sequences are the three just mentioned, |
recognized. The Unicode newline sequences are the three just mentioned, |
| 1730 |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
plus the single characters VT (vertical tab, U+000B), FF (form feed, |
| 1731 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
| 1732 |
(paragraph separator, U+2029). For the 8-bit library, the last two are |
(paragraph separator, U+2029). For the 8-bit library, the last two are |
| 1733 |
recognized only in UTF-8 mode. |
recognized only in UTF-8 mode. |
| 1741 |
cause an error. |
cause an error. |
| 1742 |
|
|
| 1743 |
The only time that a line break in a pattern is specially recognized |
The only time that a line break in a pattern is specially recognized |
| 1744 |
when compiling is when PCRE_EXTENDED is set. CR and LF are whitespace |
when compiling is when PCRE_EXTENDED is set. CR and LF are white space |
| 1745 |
characters, and so are ignored in this mode. Also, an unescaped # out- |
characters, and so are ignored in this mode. Also, an unescaped # out- |
| 1746 |
side a character class indicates a comment that lasts until after the |
side a character class indicates a comment that lasts until after the |
| 1747 |
next line break sequence. In other circumstances, line break sequences |
next line break sequence. In other circumstances, line break sequences |
| 1894 |
72 too many forward references |
72 too many forward references |
| 1895 |
73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff) |
73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff) |
| 1896 |
74 invalid UTF-16 string (specifically UTF-16) |
74 invalid UTF-16 string (specifically UTF-16) |
| 1897 |
|
75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN) |
| 1898 |
|
|
| 1899 |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
| 1900 |
values may be used if the limits were changed when PCRE was built. |
values may be used if the limits were changed when PCRE was built. |
| 2994 |
for the just-in-time processing stack is not large enough. See the |
for the just-in-time processing stack is not large enough. See the |
| 2995 |
pcrejit documentation for more details. |
pcrejit documentation for more details. |
| 2996 |
|
|
| 2997 |
PCRE_ERROR_BADMODE (-28) |
PCRE_ERROR_BADMODE (-28) |
| 2998 |
|
|
| 2999 |
This error is given if a pattern that was compiled by the 8-bit library |
This error is given if a pattern that was compiled by the 8-bit library |
| 3000 |
is passed to a 16-bit library function, or vice versa. |
is passed to a 16-bit library function, or vice versa. |
| 3001 |
|
|
| 3002 |
PCRE_ERROR_BADENDIANNESS (-29) |
PCRE_ERROR_BADENDIANNESS (-29) |
| 3003 |
|
|
| 3004 |
This error is given if a pattern that was compiled and saved is |
This error is given if a pattern that was compiled and saved is |
| 3005 |
reloaded on a host with different endianness. The utility function |
reloaded on a host with different endianness. The utility function |
| 3006 |
pcre_pattern_to_host_byte_order() can be used to convert such a pattern |
pcre_pattern_to_host_byte_order() can be used to convert such a pattern |
| 3007 |
so that it runs on the new host. |
so that it runs on the new host. |
| 3008 |
|
|
| 3009 |
Error numbers -16 to -20 and -22 are not used by pcre_exec(). |
Error numbers -16 to -20, -22, and -30 are not used by pcre_exec(). |
| 3010 |
|
|
| 3011 |
Reason codes for invalid UTF-8 strings |
Reason codes for invalid UTF-8 strings |
| 3012 |
|
|
| 3469 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
| 3470 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
| 3471 |
|
|
| 3472 |
|
PCRE_ERROR_DFA_BADRESTART (-30) |
| 3473 |
|
|
| 3474 |
|
When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some |
| 3475 |
|
plausibility checks are made on the contents of the workspace, which |
| 3476 |
|
should contain data about the previous partial match. If any of these |
| 3477 |
|
checks fail, this error is given. |
| 3478 |
|
|
| 3479 |
|
|
| 3480 |
SEE ALSO |
SEE ALSO |
| 3481 |
|
|
| 3482 |
pcre16(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematch- |
pcre16(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematch- |
| 3483 |
ing(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3), |
ing(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3), |
| 3484 |
pcrestack(3). |
pcrestack(3). |
| 3485 |
|
|
| 3493 |
|
|
| 3494 |
REVISION |
REVISION |
| 3495 |
|
|
| 3496 |
Last updated: 14 April 2012 |
Last updated: 04 May 2012 |
| 3497 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 3498 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 3499 |
|
|
| 3500 |
|
|
| 3501 |
PCRECALLOUT(3) PCRECALLOUT(3) |
PCRECALLOUT(3) PCRECALLOUT(3) |
| 3502 |
|
|
| 3503 |
|
|
| 3695 |
Last updated: 08 Janurary 2012 |
Last updated: 08 Janurary 2012 |
| 3696 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 3697 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 3698 |
|
|
| 3699 |
|
|
| 3700 |
PCRECOMPAT(3) PCRECOMPAT(3) |
PCRECOMPAT(3) PCRECOMPAT(3) |
| 3701 |
|
|
| 3702 |
|
|
| 3785 |
There is a discussion that explains these differences in more detail in |
There is a discussion that explains these differences in more detail in |
| 3786 |
the section on recursion differences from Perl in the pcrepattern page. |
the section on recursion differences from Perl in the pcrepattern page. |
| 3787 |
|
|
| 3788 |
11. If (*THEN) is present in a group that is called as a subroutine, |
11. If any of the backtracking control verbs are used in an assertion |
| 3789 |
its action is limited to that group, even if the group does not contain |
or in a subpattern that is called as a subroutine (whether or not |
| 3790 |
any | characters. |
recursively), their effect is confined to that subpattern; it does not |
| 3791 |
|
extend to the surrounding pattern. This is not always the case in Perl. |
| 3792 |
|
In particular, if (*THEN) is present in a group that is called as a |
| 3793 |
|
subroutine, its action is limited to that group, even if the group does |
| 3794 |
|
not contain any | characters. There is one exception to this: the name |
| 3795 |
|
from a *(MARK), (*PRUNE), or (*THEN) that is encountered in a success- |
| 3796 |
|
ful positive assertion is passed back when a match succeeds (compare |
| 3797 |
|
capturing parentheses in assertions). Note that such subpatterns are |
| 3798 |
|
processed as anchored at the point where they are tested. |
| 3799 |
|
|
| 3800 |
12. There are some differences that are concerned with the settings of |
12. There are some differences that are concerned with the settings of |
| 3801 |
captured strings when part of a pattern is repeated. For example, |
captured strings when part of a pattern is repeated. For example, |
| 3815 |
|
|
| 3816 |
14. Perl recognizes comments in some places that PCRE does not, for |
14. Perl recognizes comments in some places that PCRE does not, for |
| 3817 |
example, between the ( and ? at the start of a subpattern. If the /x |
example, between the ( and ? at the start of a subpattern. If the /x |
| 3818 |
modifier is set, Perl allows whitespace between ( and ? but PCRE never |
modifier is set, Perl allows white space between ( and ? but PCRE never |
| 3819 |
does, even if the PCRE_EXTENDED option is set. |
does, even if the PCRE_EXTENDED option is set. |
| 3820 |
|
|
| 3821 |
15. PCRE provides some extensions to the Perl regular expression facil- |
15. PCRE provides some extensions to the Perl regular expression facil- |
| 3875 |
|
|
| 3876 |
REVISION |
REVISION |
| 3877 |
|
|
| 3878 |
Last updated: 08 Januray 2012 |
Last updated: 01 June 2012 |
| 3879 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 3880 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 3881 |
|
|
| 3882 |
|
|
| 3883 |
PCREPATTERN(3) PCREPATTERN(3) |
PCREPATTERN(3) PCREPATTERN(3) |
| 3884 |
|
|
| 3885 |
|
|
| 4061 |
after a backslash. All other characters (in particular, those whose |
after a backslash. All other characters (in particular, those whose |
| 4062 |
codepoints are greater than 127) are treated as literals. |
codepoints are greater than 127) are treated as literals. |
| 4063 |
|
|
| 4064 |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
If a pattern is compiled with the PCRE_EXTENDED option, white space in |
| 4065 |
the pattern (other than in a character class) and characters between a |
the pattern (other than in a character class) and characters between a |
| 4066 |
# outside a character class and the next newline are ignored. An escap- |
# outside a character class and the next newline are ignored. An escap- |
| 4067 |
ing backslash can be used to include a whitespace or # character as |
ing backslash can be used to include a white space or # character as |
| 4068 |
part of the pattern. |
part of the pattern. |
| 4069 |
|
|
| 4070 |
If you want to remove the special meaning from a sequence of charac- |
If you want to remove the special meaning from a sequence of charac- |
| 4099 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
| 4100 |
\cx "control-x", where x is any ASCII character |
\cx "control-x", where x is any ASCII character |
| 4101 |
\e escape (hex 1B) |
\e escape (hex 1B) |
| 4102 |
\f formfeed (hex 0C) |
\f form feed (hex 0C) |
| 4103 |
\n linefeed (hex 0A) |
\n linefeed (hex 0A) |
| 4104 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
| 4105 |
\t tab (hex 09) |
\t tab (hex 09) |
| 4228 |
|
|
| 4229 |
\d any decimal digit |
\d any decimal digit |
| 4230 |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
| 4231 |
\h any horizontal whitespace character |
\h any horizontal white space character |
| 4232 |
\H any character that is not a horizontal whitespace character |
\H any character that is not a horizontal white space character |
| 4233 |
\s any whitespace character |
\s any white space character |
| 4234 |
\S any character that is not a whitespace character |
\S any character that is not a white space character |
| 4235 |
\v any vertical whitespace character |
\v any vertical white space character |
| 4236 |
\V any character that is not a vertical whitespace character |
\V any character that is not a vertical white space character |
| 4237 |
\w any "word" character |
\w any "word" character |
| 4238 |
\W any "non-word" character |
\W any "non-word" character |
| 4239 |
|
|
| 4313 |
|
|
| 4314 |
U+000A Linefeed |
U+000A Linefeed |
| 4315 |
U+000B Vertical tab |
U+000B Vertical tab |
| 4316 |
U+000C Formfeed |
U+000C Form feed |
| 4317 |
U+000D Carriage return |
U+000D Carriage return |
| 4318 |
U+0085 Next line |
U+0085 Next line |
| 4319 |
U+2028 Line separator |
U+2028 Line separator |
| 4333 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
| 4334 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
| 4335 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
| 4336 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- |
| 4337 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
riage return, U+000D), or NEL (next line, U+0085). The two-character |
| 4338 |
is treated as a single unit that cannot be split. |
sequence is treated as a single unit that cannot be split. |
| 4339 |
|
|
| 4340 |
In other modes, two additional characters whose codepoints are greater |
In other modes, two additional characters whose codepoints are greater |
| 4341 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
| 4535 |
|
|
| 4536 |
Xan matches characters that have either the L (letter) or the N (num- |
Xan matches characters that have either the L (letter) or the N (num- |
| 4537 |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
| 4538 |
formfeed, or carriage return, and any other character that has the Z |
form feed, or carriage return, and any other character that has the Z |
| 4539 |
(separator) property. Xsp is the same as Xps, except that vertical tab |
(separator) property. Xsp is the same as Xps, except that vertical tab |
| 4540 |
is excluded. Xwd matches the same characters as Xan, plus underscore. |
is excluded. Xwd matches the same characters as Xan, plus underscore. |
| 4541 |
|
|
| 5500 |
its following a backslash are taken as part of a potential back refer- |
its following a backslash are taken as part of a potential back refer- |
| 5501 |
ence number. If the pattern continues with a digit character, some |
ence number. If the pattern continues with a digit character, some |
| 5502 |
delimiter must be used to terminate the back reference. If the |
delimiter must be used to terminate the back reference. If the |
| 5503 |
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ |
PCRE_EXTENDED option is set, this can be white space. Otherwise, the |
| 5504 |
syntax or an empty comment (see "Comments" below) can be used. |
\g{ syntax or an empty comment (see "Comments" below) can be used. |
| 5505 |
|
|
| 5506 |
Recursive back references |
Recursive back references |
| 5507 |
|
|
| 5813 |
DEFINE is that it can be used to define subroutines that can be refer- |
DEFINE is that it can be used to define subroutines that can be refer- |
| 5814 |
enced from elsewhere. (The use of subroutines is described below.) For |
enced from elsewhere. (The use of subroutines is described below.) For |
| 5815 |
example, a pattern to match an IPv4 address such as "192.168.23.245" |
example, a pattern to match an IPv4 address such as "192.168.23.245" |
| 5816 |
could be written like this (ignore whitespace and line breaks): |
could be written like this (ignore white space and line breaks): |
| 5817 |
|
|
| 5818 |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| 5819 |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
| 6204 |
that is encountered in a successful positive assertion is passed back |
that is encountered in a successful positive assertion is passed back |
| 6205 |
when a match succeeds (compare capturing parentheses in assertions). |
when a match succeeds (compare capturing parentheses in assertions). |
| 6206 |
Note that such subpatterns are processed as anchored at the point where |
Note that such subpatterns are processed as anchored at the point where |
| 6207 |
they are tested. Note also that Perl's treatment of subroutines is dif- |
they are tested. Note also that Perl's treatment of subroutines and |
| 6208 |
ferent in some cases. |
assertions is different in some cases. |
| 6209 |
|
|
| 6210 |
The new verbs make use of what was previously invalid syntax: an open- |
The new verbs make use of what was previously invalid syntax: an open- |
| 6211 |
ing parenthesis followed by an asterisk. They are generally of the form |
ing parenthesis followed by an asterisk. They are generally of the form |
| 6212 |
(*VERB) or (*VERB:NAME). Some may take either form, with differing be- |
(*VERB) or (*VERB:NAME). Some may take either form, with differing be- |
| 6213 |
haviour, depending on whether or not an argument is present. A name is |
haviour, depending on whether or not an argument is present. A name is |
| 6214 |
any sequence of characters that does not include a closing parenthesis. |
any sequence of characters that does not include a closing parenthesis. |
| 6215 |
If the name is empty, that is, if the closing parenthesis immediately |
The maximum length of name is 255 in the 8-bit library and 65535 in the |
| 6216 |
follows the colon, the effect is as if the colon were not there. Any |
16-bit library. If the name is empty, that is, if the closing parenthe- |
| 6217 |
number of these verbs may occur in a pattern. |
sis immediately follows the colon, the effect is as if the colon were |
| 6218 |
|
not there. Any number of these verbs may occur in a pattern. |
| 6219 |
|
|
| 6220 |
Optimizations that affect backtracking verbs |
Optimizations that affect backtracking verbs |
| 6221 |
|
|
| 6222 |
PCRE contains some optimizations that are used to speed up matching by |
PCRE contains some optimizations that are used to speed up matching by |
| 6223 |
running some checks at the start of each match attempt. For example, it |
running some checks at the start of each match attempt. For example, it |
| 6224 |
may know the minimum length of matching subject, or that a particular |
may know the minimum length of matching subject, or that a particular |
| 6225 |
character must be present. When one of these optimizations suppresses |
character must be present. When one of these optimizations suppresses |
| 6226 |
the running of a match, any included backtracking verbs will not, of |
the running of a match, any included backtracking verbs will not, of |
| 6227 |
course, be processed. You can suppress the start-of-match optimizations |
course, be processed. You can suppress the start-of-match optimizations |
| 6228 |
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com- |
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com- |
| 6229 |
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). |
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). |
| 6230 |
There is more discussion of this option in the section entitled "Option |
There is more discussion of this option in the section entitled "Option |
| 6231 |
bits for pcre_exec()" in the pcreapi documentation. |
bits for pcre_exec()" in the pcreapi documentation. |
| 6232 |
|
|
| 6233 |
Experiments with Perl suggest that it too has similar optimizations, |
Experiments with Perl suggest that it too has similar optimizations, |
| 6234 |
sometimes leading to anomalous results. |
sometimes leading to anomalous results. |
| 6235 |
|
|
| 6236 |
Verbs that act immediately |
Verbs that act immediately |
| 6237 |
|
|
| 6238 |
The following verbs act as soon as they are encountered. They may not |
The following verbs act as soon as they are encountered. They may not |
| 6239 |
be followed by a name. |
be followed by a name. |
| 6240 |
|
|
| 6241 |
(*ACCEPT) |
(*ACCEPT) |
| 6242 |
|
|
| 6243 |
This verb causes the match to end successfully, skipping the remainder |
This verb causes the match to end successfully, skipping the remainder |
| 6244 |
of the pattern. However, when it is inside a subpattern that is called |
of the pattern. However, when it is inside a subpattern that is called |
| 6245 |
as a subroutine, only that subpattern is ended successfully. Matching |
as a subroutine, only that subpattern is ended successfully. Matching |
| 6246 |
then continues at the outer level. If (*ACCEPT) is inside capturing |
then continues at the outer level. If (*ACCEPT) is inside capturing |
| 6247 |
parentheses, the data so far is captured. For example: |
parentheses, the data so far is captured. For example: |
| 6248 |
|
|
| 6249 |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
| 6250 |
|
|
| 6251 |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
| 6252 |
tured by the outer parentheses. |
tured by the outer parentheses. |
| 6253 |
|
|
| 6254 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
| 6255 |
|
|
| 6256 |
This verb causes a matching failure, forcing backtracking to occur. It |
This verb causes a matching failure, forcing backtracking to occur. It |
| 6257 |
is equivalent to (?!) but easier to read. The Perl documentation notes |
is equivalent to (?!) but easier to read. The Perl documentation notes |
| 6258 |
that it is probably useful only when combined with (?{}) or (??{}). |
that it is probably useful only when combined with (?{}) or (??{}). |
| 6259 |
Those are, of course, Perl features that are not present in PCRE. The |
Those are, of course, Perl features that are not present in PCRE. The |
| 6260 |
nearest equivalent is the callout feature, as for example in this pat- |
nearest equivalent is the callout feature, as for example in this pat- |
| 6261 |
tern: |
tern: |
| 6262 |
|
|
| 6263 |
a+(?C)(*FAIL) |
a+(?C)(*FAIL) |
| 6264 |
|
|
| 6265 |
A match with the string "aaaa" always fails, but the callout is taken |
A match with the string "aaaa" always fails, but the callout is taken |
| 6266 |
before each backtrack happens (in this example, 10 times). |
before each backtrack happens (in this example, 10 times). |
| 6267 |
|
|
| 6268 |
Recording which path was taken |
Recording which path was taken |
| 6269 |
|
|
| 6270 |
There is one verb whose main purpose is to track how a match was |
There is one verb whose main purpose is to track how a match was |
| 6271 |
arrived at, though it also has a secondary use in conjunction with |
arrived at, though it also has a secondary use in conjunction with |
| 6272 |
advancing the match starting point (see (*SKIP) below). |
advancing the match starting point (see (*SKIP) below). |
| 6273 |
|
|
| 6274 |
(*MARK:NAME) or (*:NAME) |
(*MARK:NAME) or (*:NAME) |
| 6275 |
|
|
| 6276 |
A name is always required with this verb. There may be as many |
A name is always required with this verb. There may be as many |
| 6277 |
instances of (*MARK) as you like in a pattern, and their names do not |
instances of (*MARK) as you like in a pattern, and their names do not |
| 6278 |
have to be unique. |
have to be unique. |
| 6279 |
|
|
| 6280 |
When a match succeeds, the name of the last-encountered (*MARK) on the |
When a match succeeds, the name of the last-encountered (*MARK) on the |
| 6281 |
matching path is passed back to the caller as described in the section |
matching path is passed back to the caller as described in the section |
| 6282 |
entitled "Extra data for pcre_exec()" in the pcreapi documentation. |
entitled "Extra data for pcre_exec()" in the pcreapi documentation. |
| 6283 |
Here is an example of pcretest output, where the /K modifier requests |
Here is an example of pcretest output, where the /K modifier requests |
| 6284 |
the retrieval and outputting of (*MARK) data: |
the retrieval and outputting of (*MARK) data: |
| 6285 |
|
|
| 6286 |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
| 6292 |
MK: B |
MK: B |
| 6293 |
|
|
| 6294 |
The (*MARK) name is tagged with "MK:" in this output, and in this exam- |
The (*MARK) name is tagged with "MK:" in this output, and in this exam- |
| 6295 |
ple it indicates which of the two alternatives matched. This is a more |
ple it indicates which of the two alternatives matched. This is a more |
| 6296 |
efficient way of obtaining this information than putting each alterna- |
efficient way of obtaining this information than putting each alterna- |
| 6297 |
tive in its own capturing parentheses. |
tive in its own capturing parentheses. |
| 6298 |
|
|
| 6299 |
If (*MARK) is encountered in a positive assertion, its name is recorded |
If (*MARK) is encountered in a positive assertion, its name is recorded |
| 6300 |
and passed back if it is the last-encountered. This does not happen for |
and passed back if it is the last-encountered. This does not happen for |
| 6301 |
negative assertions. |
negative assertions. |
| 6302 |
|
|
| 6303 |
After a partial match or a failed match, the name of the last encoun- |
After a partial match or a failed match, the name of the last encoun- |
| 6304 |
tered (*MARK) in the entire match process is returned. For example: |
tered (*MARK) in the entire match process is returned. For example: |
| 6305 |
|
|
| 6306 |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
| 6307 |
data> XP |
data> XP |
| 6308 |
No match, mark = B |
No match, mark = B |
| 6309 |
|
|
| 6310 |
Note that in this unanchored example the mark is retained from the |
Note that in this unanchored example the mark is retained from the |
| 6311 |
match attempt that started at the letter "X" in the subject. Subsequent |
match attempt that started at the letter "X" in the subject. Subsequent |
| 6312 |
match attempts starting at "P" and then with an empty string do not get |
match attempts starting at "P" and then with an empty string do not get |
| 6313 |
as far as the (*MARK) item, but nevertheless do not reset it. |
as far as the (*MARK) item, but nevertheless do not reset it. |
| 6314 |
|
|
| 6315 |
If you are interested in (*MARK) values after failed matches, you |
If you are interested in (*MARK) values after failed matches, you |
| 6316 |
should probably set the PCRE_NO_START_OPTIMIZE option (see above) to |
should probably set the PCRE_NO_START_OPTIMIZE option (see above) to |
| 6317 |
ensure that the match is always attempted. |
ensure that the match is always attempted. |
| 6318 |
|
|
| 6319 |
Verbs that act after backtracking |
Verbs that act after backtracking |
| 6320 |
|
|
| 6321 |
The following verbs do nothing when they are encountered. Matching con- |
The following verbs do nothing when they are encountered. Matching con- |
| 6322 |
tinues with what follows, but if there is no subsequent match, causing |
tinues with what follows, but if there is no subsequent match, causing |
| 6323 |
a backtrack to the verb, a failure is forced. That is, backtracking |
a backtrack to the verb, a failure is forced. That is, backtracking |
| 6324 |
cannot pass to the left of the verb. However, when one of these verbs |
cannot pass to the left of the verb. However, when one of these verbs |
| 6325 |
appears inside an atomic group, its effect is confined to that group, |
appears inside an atomic group, its effect is confined to that group, |
| 6326 |
because once the group has been matched, there is never any backtrack- |
because once the group has been matched, there is never any backtrack- |
| 6327 |
ing into it. In this situation, backtracking can "jump back" to the |
ing into it. In this situation, backtracking can "jump back" to the |
| 6328 |
left of the entire atomic group. (Remember also, as stated above, that |
left of the entire atomic group. (Remember also, as stated above, that |
| 6329 |
this localization also applies in subroutine calls and assertions.) |
this localization also applies in subroutine calls and assertions.) |
| 6330 |
|
|
| 6331 |
These verbs differ in exactly what kind of failure occurs when back- |
These verbs differ in exactly what kind of failure occurs when back- |
| 6332 |
tracking reaches them. |
tracking reaches them. |
| 6333 |
|
|
| 6334 |
(*COMMIT) |
(*COMMIT) |
| 6335 |
|
|
| 6336 |
This verb, which may not be followed by a name, causes the whole match |
This verb, which may not be followed by a name, causes the whole match |
| 6337 |
to fail outright if the rest of the pattern does not match. Even if the |
to fail outright if the rest of the pattern does not match. Even if the |
| 6338 |
pattern is unanchored, no further attempts to find a match by advancing |
pattern is unanchored, no further attempts to find a match by advancing |
| 6339 |
the starting point take place. Once (*COMMIT) has been passed, |
the starting point take place. Once (*COMMIT) has been passed, |
| 6340 |
pcre_exec() is committed to finding a match at the current starting |
pcre_exec() is committed to finding a match at the current starting |
| 6341 |
point, or not at all. For example: |
point, or not at all. For example: |
| 6342 |
|
|
| 6343 |
a+(*COMMIT)b |
a+(*COMMIT)b |
| 6344 |
|
|
| 6345 |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
| 6346 |
of dynamic anchor, or "I've started, so I must finish." The name of the |
of dynamic anchor, or "I've started, so I must finish." The name of the |
| 6347 |
most recently passed (*MARK) in the path is passed back when (*COMMIT) |
most recently passed (*MARK) in the path is passed back when (*COMMIT) |
| 6348 |
forces a match failure. |
forces a match failure. |
| 6349 |
|
|
| 6350 |
Note that (*COMMIT) at the start of a pattern is not the same as an |
Note that (*COMMIT) at the start of a pattern is not the same as an |
| 6351 |
anchor, unless PCRE's start-of-match optimizations are turned off, as |
anchor, unless PCRE's start-of-match optimizations are turned off, as |
| 6352 |
shown in this pcretest example: |
shown in this pcretest example: |
| 6353 |
|
|
| 6354 |
re> /(*COMMIT)abc/ |
re> /(*COMMIT)abc/ |
| 6357 |
xyzabc\Y |
xyzabc\Y |
| 6358 |
No match |
No match |
| 6359 |
|
|
| 6360 |
PCRE knows that any match must start with "a", so the optimization |
PCRE knows that any match must start with "a", so the optimization |
| 6361 |
skips along the subject to "a" before running the first match attempt, |
skips along the subject to "a" before running the first match attempt, |
| 6362 |
which succeeds. When the optimization is disabled by the \Y escape in |
which succeeds. When the optimization is disabled by the \Y escape in |
| 6363 |
the second subject, the match starts at "x" and so the (*COMMIT) causes |
the second subject, the match starts at "x" and so the (*COMMIT) causes |
| 6364 |
it to fail without trying any other starting points. |
it to fail without trying any other starting points. |
| 6365 |
|
|
| 6366 |
(*PRUNE) or (*PRUNE:NAME) |
(*PRUNE) or (*PRUNE:NAME) |
| 6367 |
|
|
| 6368 |
This verb causes the match to fail at the current starting position in |
This verb causes the match to fail at the current starting position in |
| 6369 |
the subject if the rest of the pattern does not match. If the pattern |
the subject if the rest of the pattern does not match. If the pattern |
| 6370 |
is unanchored, the normal "bumpalong" advance to the next starting |
is unanchored, the normal "bumpalong" advance to the next starting |
| 6371 |
character then happens. Backtracking can occur as usual to the left of |
character then happens. Backtracking can occur as usual to the left of |
| 6372 |
(*PRUNE), before it is reached, or when matching to the right of |
(*PRUNE), before it is reached, or when matching to the right of |
| 6373 |
(*PRUNE), but if there is no match to the right, backtracking cannot |
(*PRUNE), but if there is no match to the right, backtracking cannot |
| 6374 |
cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter- |
cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter- |
| 6375 |
native to an atomic group or possessive quantifier, but there are some |
native to an atomic group or possessive quantifier, but there are some |
| 6376 |
uses of (*PRUNE) that cannot be expressed in any other way. The behav- |
uses of (*PRUNE) that cannot be expressed in any other way. The behav- |
| 6377 |
iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an |
iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an |
| 6378 |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). |
| 6379 |
|
|
| 6380 |
(*SKIP) |
(*SKIP) |
| 6381 |
|
|
| 6382 |
This verb, when given without a name, is like (*PRUNE), except that if |
This verb, when given without a name, is like (*PRUNE), except that if |
| 6383 |
the pattern is unanchored, the "bumpalong" advance is not to the next |
the pattern is unanchored, the "bumpalong" advance is not to the next |
| 6384 |
character, but to the position in the subject where (*SKIP) was encoun- |
character, but to the position in the subject where (*SKIP) was encoun- |
| 6385 |
tered. (*SKIP) signifies that whatever text was matched leading up to |
tered. (*SKIP) signifies that whatever text was matched leading up to |
| 6386 |
it cannot be part of a successful match. Consider: |
it cannot be part of a successful match. Consider: |
| 6387 |
|
|
| 6388 |
a+(*SKIP)b |
a+(*SKIP)b |
| 6389 |
|
|
| 6390 |
If the subject is "aaaac...", after the first match attempt fails |
If the subject is "aaaac...", after the first match attempt fails |
| 6391 |
(starting at the first character in the string), the starting point |
(starting at the first character in the string), the starting point |
| 6392 |
skips on to start the next attempt at "c". Note that a possessive quan- |
skips on to start the next attempt at "c". Note that a possessive quan- |
| 6393 |
tifer does not have the same effect as this example; although it would |
tifer does not have the same effect as this example; although it would |
| 6394 |
suppress backtracking during the first match attempt, the second |
suppress backtracking during the first match attempt, the second |
| 6395 |
attempt would start at the second character instead of skipping on to |
attempt would start at the second character instead of skipping on to |
| 6396 |
"c". |
"c". |
| 6397 |
|
|
| 6398 |
(*SKIP:NAME) |
(*SKIP:NAME) |
| 6399 |
|
|
| 6400 |
When (*SKIP) has an associated name, its behaviour is modified. If the |
When (*SKIP) has an associated name, its behaviour is modified. If the |
| 6401 |
following pattern fails to match, the previous path through the pattern |
following pattern fails to match, the previous path through the pattern |
| 6402 |
is searched for the most recent (*MARK) that has the same name. If one |
is searched for the most recent (*MARK) that has the same name. If one |
| 6403 |
is found, the "bumpalong" advance is to the subject position that cor- |
is found, the "bumpalong" advance is to the subject position that cor- |
| 6404 |
responds to that (*MARK) instead of to where (*SKIP) was encountered. |
responds to that (*MARK) instead of to where (*SKIP) was encountered. |
| 6405 |
If no (*MARK) with a matching name is found, the (*SKIP) is ignored. |
If no (*MARK) with a matching name is found, the (*SKIP) is ignored. |
| 6406 |
|
|
| 6407 |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
| 6408 |
|
|
| 6409 |
This verb causes a skip to the next innermost alternative if the rest |
This verb causes a skip to the next innermost alternative if the rest |
| 6410 |
of the pattern does not match. That is, it cancels pending backtrack- |
of the pattern does not match. That is, it cancels pending backtrack- |
| 6411 |
ing, but only within the current alternative. Its name comes from the |
ing, but only within the current alternative. Its name comes from the |
| 6412 |
observation that it can be used for a pattern-based if-then-else block: |
observation that it can be used for a pattern-based if-then-else block: |
| 6413 |
|
|
| 6414 |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
| 6415 |
|
|
| 6416 |
If the COND1 pattern matches, FOO is tried (and possibly further items |
If the COND1 pattern matches, FOO is tried (and possibly further items |
| 6417 |
after the end of the group if FOO succeeds); on failure, the matcher |
after the end of the group if FOO succeeds); on failure, the matcher |
| 6418 |
skips to the second alternative and tries COND2, without backtracking |
skips to the second alternative and tries COND2, without backtracking |
| 6419 |
into COND1. The behaviour of (*THEN:NAME) is exactly the same as |
into COND1. The behaviour of (*THEN:NAME) is exactly the same as |
| 6420 |
(*MARK:NAME)(*THEN). If (*THEN) is not inside an alternation, it acts |
(*MARK:NAME)(*THEN). If (*THEN) is not inside an alternation, it acts |
| 6421 |
like (*PRUNE). |
like (*PRUNE). |
| 6422 |
|
|
| 6423 |
Note that a subpattern that does not contain a | character is just a |
Note that a subpattern that does not contain a | character is just a |
| 6424 |
part of the enclosing alternative; it is not a nested alternation with |
part of the enclosing alternative; it is not a nested alternation with |
| 6425 |
only one alternative. The effect of (*THEN) extends beyond such a sub- |
only one alternative. The effect of (*THEN) extends beyond such a sub- |
| 6426 |
pattern to the enclosing alternative. Consider this pattern, where A, |
pattern to the enclosing alternative. Consider this pattern, where A, |
| 6427 |
B, etc. are complex pattern fragments that do not contain any | charac- |
B, etc. are complex pattern fragments that do not contain any | charac- |
| 6428 |
ters at this level: |
ters at this level: |
| 6429 |
|
|
| 6430 |
A (B(*THEN)C) | D |
A (B(*THEN)C) | D |
| 6431 |
|
|
| 6432 |
If A and B are matched, but there is a failure in C, matching does not |
If A and B are matched, but there is a failure in C, matching does not |
| 6433 |
backtrack into A; instead it moves to the next alternative, that is, D. |
backtrack into A; instead it moves to the next alternative, that is, D. |
| 6434 |
However, if the subpattern containing (*THEN) is given an alternative, |
However, if the subpattern containing (*THEN) is given an alternative, |
| 6435 |
it behaves differently: |
it behaves differently: |
| 6436 |
|
|
| 6437 |
A (B(*THEN)C | (*FAIL)) | D |
A (B(*THEN)C | (*FAIL)) | D |
| 6438 |
|
|
| 6439 |
The effect of (*THEN) is now confined to the inner subpattern. After a |
The effect of (*THEN) is now confined to the inner subpattern. After a |
| 6440 |
failure in C, matching moves to (*FAIL), which causes the whole subpat- |
failure in C, matching moves to (*FAIL), which causes the whole subpat- |
| 6441 |
tern to fail because there are no more alternatives to try. In this |
tern to fail because there are no more alternatives to try. In this |
| 6442 |
case, matching does now backtrack into A. |
case, matching does now backtrack into A. |
| 6443 |
|
|
| 6444 |
Note also that a conditional subpattern is not considered as having two |
Note also that a conditional subpattern is not considered as having two |
| 6445 |
alternatives, because only one is ever used. In other words, the | |
alternatives, because only one is ever used. In other words, the | |
| 6446 |
character in a conditional subpattern has a different meaning. Ignoring |
character in a conditional subpattern has a different meaning. Ignoring |
| 6447 |
white space, consider: |
white space, consider: |
| 6448 |
|
|
| 6449 |
^.*? (?(?=a) a | b(*THEN)c ) |
^.*? (?(?=a) a | b(*THEN)c ) |
| 6450 |
|
|
| 6451 |
If the subject is "ba", this pattern does not match. Because .*? is |
If the subject is "ba", this pattern does not match. Because .*? is |
| 6452 |
ungreedy, it initially matches zero characters. The condition (?=a) |
ungreedy, it initially matches zero characters. The condition (?=a) |
| 6453 |
then fails, the character "b" is matched, but "c" is not. At this |
then fails, the character "b" is matched, but "c" is not. At this |
| 6454 |
point, matching does not backtrack to .*? as might perhaps be expected |
point, matching does not backtrack to .*? as might perhaps be expected |
| 6455 |
from the presence of the | character. The conditional subpattern is |
from the presence of the | character. The conditional subpattern is |
| 6456 |
part of the single alternative that comprises the whole pattern, and so |
part of the single alternative that comprises the whole pattern, and so |
| 6457 |
the match fails. (If there was a backtrack into .*?, allowing it to |
the match fails. (If there was a backtrack into .*?, allowing it to |
| 6458 |
match "b", the match would succeed.) |
match "b", the match would succeed.) |
| 6459 |
|
|
| 6460 |
The verbs just described provide four different "strengths" of control |
The verbs just described provide four different "strengths" of control |
| 6461 |
when subsequent matching fails. (*THEN) is the weakest, carrying on the |
when subsequent matching fails. (*THEN) is the weakest, carrying on the |
| 6462 |
match at the next alternative. (*PRUNE) comes next, failing the match |
match at the next alternative. (*PRUNE) comes next, failing the match |
| 6463 |
at the current starting position, but allowing an advance to the next |
at the current starting position, but allowing an advance to the next |
| 6464 |
character (for an unanchored pattern). (*SKIP) is similar, except that |
character (for an unanchored pattern). (*SKIP) is similar, except that |
| 6465 |
the advance may be more than one character. (*COMMIT) is the strongest, |
the advance may be more than one character. (*COMMIT) is the strongest, |
| 6466 |
causing the entire match to fail. |
causing the entire match to fail. |
| 6467 |
|
|
| 6471 |
|
|
| 6472 |
(A(*COMMIT)B(*THEN)C|D) |
(A(*COMMIT)B(*THEN)C|D) |
| 6473 |
|
|
| 6474 |
Once A has matched, PCRE is committed to this match, at the current |
Once A has matched, PCRE is committed to this match, at the current |
| 6475 |
starting position. If subsequently B matches, but C does not, the nor- |
starting position. If subsequently B matches, but C does not, the nor- |
| 6476 |
mal (*THEN) action of trying the next alternative (that is, D) does not |
mal (*THEN) action of trying the next alternative (that is, D) does not |
| 6477 |
happen because (*COMMIT) overrides. |
happen because (*COMMIT) overrides. |
| 6478 |
|
|
| 6479 |
|
|
| 6480 |
SEE ALSO |
SEE ALSO |
| 6481 |
|
|
| 6482 |
pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3), |
pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3), |
| 6483 |
pcre16(3). |
pcre16(3). |
| 6484 |
|
|
| 6485 |
|
|
| 6492 |
|
|
| 6493 |
REVISION |
REVISION |
| 6494 |
|
|
| 6495 |
Last updated: 14 April 2012 |
Last updated: 01 June 2012 |
| 6496 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 6497 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 6498 |
|
|
| 6499 |
|
|
| 6500 |
PCRESYNTAX(3) PCRESYNTAX(3) |
PCRESYNTAX(3) PCRESYNTAX(3) |
| 6501 |
|
|
| 6502 |
|
|
| 6522 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
| 6523 |
\cx "control-x", where x is any ASCII character |
\cx "control-x", where x is any ASCII character |
| 6524 |
\e escape (hex 1B) |
\e escape (hex 1B) |
| 6525 |
\f formfeed (hex 0C) |
\f form feed (hex 0C) |
| 6526 |
\n newline (hex 0A) |
\n newline (hex 0A) |
| 6527 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
| 6528 |
\t tab (hex 09) |
\t tab (hex 09) |
| 6538 |
\C one data unit, even in UTF mode (best avoided) |
\C one data unit, even in UTF mode (best avoided) |
| 6539 |
\d a decimal digit |
\d a decimal digit |
| 6540 |
\D a character that is not a decimal digit |
\D a character that is not a decimal digit |
| 6541 |
\h a horizontal whitespace character |
\h a horizontal white space character |
| 6542 |
\H a character that is not a horizontal whitespace character |
\H a character that is not a horizontal white space character |
| 6543 |
\N a character that is not a newline |
\N a character that is not a newline |
| 6544 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
| 6545 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
| 6546 |
\R a newline sequence |
\R a newline sequence |
| 6547 |
\s a whitespace character |
\s a white space character |
| 6548 |
\S a character that is not a whitespace character |
\S a character that is not a white space character |
| 6549 |
\v a vertical whitespace character |
\v a vertical white space character |
| 6550 |
\V a character that is not a vertical whitespace character |
\V a character that is not a vertical white space character |
| 6551 |
\w a "word" character |
\w a "word" character |
| 6552 |
\W a "non-word" character |
\W a "non-word" character |
| 6553 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
| 6651 |
lower lower case letter |
lower lower case letter |
| 6652 |
print printing, including space |
print printing, including space |
| 6653 |
punct printing, excluding alphanumeric |
punct printing, excluding alphanumeric |
| 6654 |
space whitespace |
space white space |
| 6655 |
upper upper case letter |
upper upper case letter |
| 6656 |
word same as \w |
word same as \w |
| 6657 |
xdigit hexadecimal digit |
xdigit hexadecimal digit |
| 6873 |
Last updated: 10 January 2012 |
Last updated: 10 January 2012 |
| 6874 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 6875 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 6876 |
|
|
| 6877 |
|
|
| 6878 |
PCREUNICODE(3) PCREUNICODE(3) |
PCREUNICODE(3) PCREUNICODE(3) |
| 6879 |
|
|
| 6880 |
|
|
| 6952 |
|
|
| 6953 |
If an invalid UTF-8 string is passed to PCRE, an error return is given. |
If an invalid UTF-8 string is passed to PCRE, an error return is given. |
| 6954 |
At compile time, the only additional information is the offset to the |
At compile time, the only additional information is the offset to the |
| 6955 |
first byte of the failing character. The runtime functions pcre_exec() |
first byte of the failing character. The run-time functions pcre_exec() |
| 6956 |
and pcre_dfa_exec() also pass back this information, as well as a more |
and pcre_dfa_exec() also pass back this information, as well as a more |
| 6957 |
detailed reason code if the caller has provided memory in which to do |
detailed reason code if the caller has provided memory in which to do |
| 6958 |
this. |
this. |
| 6993 |
|
|
| 6994 |
If an invalid UTF-16 string is passed to PCRE, an error return is |
If an invalid UTF-16 string is passed to PCRE, an error return is |
| 6995 |
given. At compile time, the only additional information is the offset |
given. At compile time, the only additional information is the offset |
| 6996 |
to the first data unit of the failing character. The runtime functions |
to the first data unit of the failing character. The run-time functions |
| 6997 |
pcre16_exec() and pcre16_dfa_exec() also pass back this information, as |
pcre16_exec() and pcre16_dfa_exec() also pass back this information, as |
| 6998 |
well as a more detailed reason code if the caller has provided memory |
well as a more detailed reason code if the caller has provided memory |
| 6999 |
in which to do this. |
in which to do this. |
| 7047 |
7. Similarly, characters that match the POSIX named character classes |
7. Similarly, characters that match the POSIX named character classes |
| 7048 |
are all low-valued characters, unless the PCRE_UCP option is set. |
are all low-valued characters, unless the PCRE_UCP option is set. |
| 7049 |
|
|
| 7050 |
8. However, the horizontal and vertical whitespace matching escapes |
8. However, the horizontal and vertical white space matching escapes |
| 7051 |
(\h, \H, \v, and \V) do match all the appropriate Unicode characters, |
(\h, \H, \v, and \V) do match all the appropriate Unicode characters, |
| 7052 |
whether or not PCRE_UCP is set. |
whether or not PCRE_UCP is set. |
| 7053 |
|
|
| 7074 |
Last updated: 14 April 2012 |
Last updated: 14 April 2012 |
| 7075 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 7076 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 7077 |
|
|
| 7078 |
|
|
| 7079 |
PCREJIT(3) PCREJIT(3) |
PCREJIT(3) PCREJIT(3) |
| 7080 |
|
|
| 7081 |
|
|
| 7226 |
|
|
| 7227 |
\C match a single byte; not supported in UTF-8 mode |
\C match a single byte; not supported in UTF-8 mode |
| 7228 |
(?Cn) callouts |
(?Cn) callouts |
| 7229 |
(*COMMIT) ) |
(*PRUNE) ) |
| 7230 |
(*MARK) ) |
(*SKIP) ) backtracking control verbs |
|
(*PRUNE) ) the backtracking control verbs |
|
|
(*SKIP) ) |
|
| 7231 |
(*THEN) ) |
(*THEN) ) |
| 7232 |
|
|
| 7233 |
Support for some of these may be added in future. |
Support for some of these may be added in future. |
| 7456 |
|
|
| 7457 |
REVISION |
REVISION |
| 7458 |
|
|
| 7459 |
Last updated: 14 April 2012 |
Last updated: 04 May 2012 |
| 7460 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 7461 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 7462 |
|
|
| 7463 |
|
|
| 7464 |
PCREPARTIAL(3) PCREPARTIAL(3) |
PCREPARTIAL(3) PCREPARTIAL(3) |
| 7465 |
|
|
| 7466 |
|
|
| 7909 |
Last updated: 24 February 2012 |
Last updated: 24 February 2012 |
| 7910 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 7911 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 7912 |
|
|
| 7913 |
|
|
| 7914 |
PCREPRECOMPILE(3) PCREPRECOMPILE(3) |
PCREPRECOMPILE(3) PCREPRECOMPILE(3) |
| 7915 |
|
|
| 7916 |
|
|
| 8044 |
Last updated: 10 January 2012 |
Last updated: 10 January 2012 |
| 8045 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 8046 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 8047 |
|
|
| 8048 |
|
|
| 8049 |
PCREPERFORM(3) PCREPERFORM(3) |
PCREPERFORM(3) PCREPERFORM(3) |
| 8050 |
|
|
| 8051 |
|
|
| 8214 |
Last updated: 09 January 2012 |
Last updated: 09 January 2012 |
| 8215 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 8216 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 8217 |
|
|
| 8218 |
|
|
| 8219 |
PCREPOSIX(3) PCREPOSIX(3) |
PCREPOSIX(3) PCREPOSIX(3) |
| 8220 |
|
|
| 8221 |
|
|
| 8478 |
Last updated: 09 January 2012 |
Last updated: 09 January 2012 |
| 8479 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 8480 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 8481 |
|
|
| 8482 |
|
|
| 8483 |
PCRECPP(3) PCRECPP(3) |
PCRECPP(3) PCRECPP(3) |
| 8484 |
|
|
| 8485 |
|
|
| 8656 |
PCRE_DOTALL dot matches newlines /s |
PCRE_DOTALL dot matches newlines /s |
| 8657 |
PCRE_DOLLAR_ENDONLY $ matches only at end N/A |
PCRE_DOLLAR_ENDONLY $ matches only at end N/A |
| 8658 |
PCRE_EXTRA strict escape parsing N/A |
PCRE_EXTRA strict escape parsing N/A |
| 8659 |
PCRE_EXTENDED ignore whitespaces /x |
PCRE_EXTENDED ignore white spaces /x |
| 8660 |
PCRE_UTF8 handles UTF8 chars built-in |
PCRE_UTF8 handles UTF8 chars built-in |
| 8661 |
PCRE_UNGREEDY reverses * and *? N/A |
PCRE_UNGREEDY reverses * and *? N/A |
| 8662 |
PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) |
PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) |
| 8820 |
|
|
| 8821 |
Last updated: 08 January 2012 |
Last updated: 08 January 2012 |
| 8822 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 8823 |
|
|
| 8824 |
|
|
| 8825 |
PCRESAMPLE(3) PCRESAMPLE(3) |
PCRESAMPLE(3) PCRESAMPLE(3) |
| 8826 |
|
|
| 8827 |
|
|
| 8944 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
| 8945 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
| 8946 |
|
|
| 8947 |
|
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or |
| 8948 |
|
(*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit |
| 8949 |
|
library. |
| 8950 |
|
|
| 8951 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
| 8952 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
| 8953 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
| 8965 |
|
|
| 8966 |
REVISION |
REVISION |
| 8967 |
|
|
| 8968 |
Last updated: 08 January 2012 |
Last updated: 04 May 2012 |
| 8969 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 8970 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 8971 |
|
|
| 8972 |
|
|
| 8973 |
PCRESTACK(3) PCRESTACK(3) |
PCRESTACK(3) PCRESTACK(3) |
| 8974 |
|
|
| 8975 |
|
|
| 9153 |
Last updated: 21 January 2012 |
Last updated: 21 January 2012 |
| 9154 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| 9155 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 9156 |
|
|
| 9157 |
|
|