| 45 |
|
|
| 46 |
Details of exactly which Perl regular expression features are and are |
Details of exactly which Perl regular expression features are and are |
| 47 |
not supported by PCRE are given in separate documents. See the pcrepat- |
not supported by PCRE are given in separate documents. See the pcrepat- |
| 48 |
tern and pcrecompat pages. |
tern and pcrecompat pages. There is a syntax summary in the pcresyntax |
| 49 |
|
page. |
| 50 |
|
|
| 51 |
Some features of PCRE can be included, excluded, or changed when the |
Some features of PCRE can be included, excluded, or changed when the |
| 52 |
library is built. The pcre_config() function makes it possible for a |
library is built. The pcre_config() function makes it possible for a |
| 53 |
client to discover which features are available. The features them- |
client to discover which features are available. The features them- |
| 54 |
selves are described in the pcrebuild page. Documentation about build- |
selves are described in the pcrebuild page. Documentation about build- |
| 55 |
ing PCRE for various operating systems can be found in the README file |
ing PCRE for various operating systems can be found in the README file |
| 56 |
in the source distribution. |
in the source distribution. |
| 57 |
|
|
| 58 |
The library contains a number of undocumented internal functions and |
The library contains a number of undocumented internal functions and |
| 59 |
data tables that are used by more than one of the exported external |
data tables that are used by more than one of the exported external |
| 60 |
functions, but which are not intended for use by external callers. |
functions, but which are not intended for use by external callers. |
| 61 |
Their names all begin with "_pcre_", which hopefully will not provoke |
Their names all begin with "_pcre_", which hopefully will not provoke |
| 62 |
any name clashes. In some environments, it is possible to control which |
any name clashes. In some environments, it is possible to control which |
| 63 |
external symbols are exported when a shared library is built, and in |
external symbols are exported when a shared library is built, and in |
| 64 |
these cases the undocumented symbols are not exported. |
these cases the undocumented symbols are not exported. |
| 65 |
|
|
| 66 |
|
|
| 67 |
USER DOCUMENTATION |
USER DOCUMENTATION |
| 68 |
|
|
| 69 |
The user documentation for PCRE comprises a number of different sec- |
The user documentation for PCRE comprises a number of different sec- |
| 70 |
tions. In the "man" format, each of these is a separate "man page". In |
tions. In the "man" format, each of these is a separate "man page". In |
| 71 |
the HTML format, each is a separate page, linked from the index page. |
the HTML format, each is a separate page, linked from the index page. |
| 72 |
In the plain text format, all the sections are concatenated, for ease |
In the plain text format, all the sections are concatenated, for ease |
| 73 |
of searching. The sections are as follows: |
of searching. The sections are as follows: |
| 74 |
|
|
| 75 |
pcre this document |
pcre this document |
| 84 |
pcrepartial details of the partial matching facility |
pcrepartial details of the partial matching facility |
| 85 |
pcrepattern syntax and semantics of supported |
pcrepattern syntax and semantics of supported |
| 86 |
regular expressions |
regular expressions |
| 87 |
|
pcresyntax quick syntax reference |
| 88 |
pcreperform discussion of performance issues |
pcreperform discussion of performance issues |
| 89 |
pcreposix the POSIX-compatible C API |
pcreposix the POSIX-compatible C API |
| 90 |
pcreprecompile details of saving and re-using precompiled patterns |
pcreprecompile details of saving and re-using precompiled patterns |
| 92 |
pcrestack discussion of stack usage |
pcrestack discussion of stack usage |
| 93 |
pcretest description of the pcretest testing command |
pcretest description of the pcretest testing command |
| 94 |
|
|
| 95 |
In addition, in the "man" and HTML formats, there is a short page for |
In addition, in the "man" and HTML formats, there is a short page for |
| 96 |
each C library function, listing its arguments and results. |
each C library function, listing its arguments and results. |
| 97 |
|
|
| 98 |
|
|
| 99 |
LIMITATIONS |
LIMITATIONS |
| 100 |
|
|
| 101 |
There are some size limitations in PCRE but it is hoped that they will |
There are some size limitations in PCRE but it is hoped that they will |
| 102 |
never in practice be relevant. |
never in practice be relevant. |
| 103 |
|
|
| 104 |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
| 105 |
is compiled with the default internal linkage size of 2. If you want to |
is compiled with the default internal linkage size of 2. If you want to |
| 106 |
process regular expressions that are truly enormous, you can compile |
process regular expressions that are truly enormous, you can compile |
| 107 |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
| 108 |
the source distribution and the pcrebuild documentation for details). |
the source distribution and the pcrebuild documentation for details). |
| 109 |
In these cases the limit is substantially larger. However, the speed |
In these cases the limit is substantially larger. However, the speed |
| 110 |
of execution is slower. |
of execution is slower. |
| 111 |
|
|
| 112 |
All values in repeating quantifiers must be less than 65536. The maxi- |
All values in repeating quantifiers must be less than 65536. |
|
mum compiled length of subpattern with an explicit repeat count is |
|
|
30000 bytes. The maximum number of capturing subpatterns is 65535. |
|
| 113 |
|
|
| 114 |
There is no limit to the number of parenthesized subpatterns, but there |
There is no limit to the number of parenthesized subpatterns, but there |
| 115 |
can be no more than 65535 capturing subpatterns. |
can be no more than 65535 capturing subpatterns. |
| 116 |
|
|
|
If a non-capturing subpattern with an unlimited repetition quantifier |
|
|
can match an empty string, there is a limit of 1000 on the number of |
|
|
times it can be repeated while not matching an empty string - if it |
|
|
does match an empty string, the loop is immediately broken. |
|
|
|
|
| 117 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
| 118 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
| 119 |
|
|
| 226 |
|
|
| 227 |
REVISION |
REVISION |
| 228 |
|
|
| 229 |
Last updated: 30 July 2007 |
Last updated: 06 August 2007 |
| 230 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 231 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 232 |
|
|
| 2207 |
subpatterns are not required to be unique. Normally, patterns with |
subpatterns are not required to be unique. Normally, patterns with |
| 2208 |
duplicate names are such that in any one match, only one of the named |
duplicate names are such that in any one match, only one of the named |
| 2209 |
subpatterns participates. An example is shown in the pcrepattern docu- |
subpatterns participates. An example is shown in the pcrepattern docu- |
| 2210 |
mentation. When duplicates are present, pcre_copy_named_substring() and |
mentation. |
| 2211 |
|
|
| 2212 |
|
When duplicates are present, pcre_copy_named_substring() and |
| 2213 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
| 2214 |
the given name that is set. If none are set, an empty string is |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
| 2215 |
returned. The pcre_get_stringnumber() function returns one of the num- |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
| 2216 |
bers that are associated with the name, but it is not defined which it |
function returns one of the numbers that are associated with the name, |
| 2217 |
is. |
but it is not defined which it is. |
| 2218 |
|
|
| 2219 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
| 2220 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
| 2729 |
|
|
| 2730 |
PCRE REGULAR EXPRESSION DETAILS |
PCRE REGULAR EXPRESSION DETAILS |
| 2731 |
|
|
| 2732 |
The syntax and semantics of the regular expressions supported by PCRE |
The syntax and semantics of the regular expressions that are supported |
| 2733 |
are described below. Regular expressions are also described in the Perl |
by PCRE are described in detail below. There is a quick-reference syn- |
| 2734 |
documentation and in a number of books, some of which have copious |
tax summary in the pcresyntax page. Perl's regular expressions are |
| 2735 |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published |
described in its own documentation, and regular expressions in general |
| 2736 |
by O'Reilly, covers regular expressions in great detail. This descrip- |
are covered in a number of books, some of which have copious examples. |
| 2737 |
tion of PCRE's regular expressions is intended as reference material. |
Jeffrey Friedl's "Mastering Regular Expressions", published by |
| 2738 |
|
O'Reilly, covers regular expressions in great detail. This description |
| 2739 |
|
of PCRE's regular expressions is intended as reference material. |
| 2740 |
|
|
| 2741 |
The original operation of PCRE was on strings of one-byte characters. |
The original operation of PCRE was on strings of one-byte characters. |
| 2742 |
However, there is now also support for UTF-8 character strings. To use |
However, there is now also support for UTF-8 character strings. To use |
| 2938 |
|
|
| 2939 |
Absolute and relative back references |
Absolute and relative back references |
| 2940 |
|
|
| 2941 |
The sequence \g followed by a positive or negative number, optionally |
The sequence \g followed by an unsigned or a negative number, option- |
| 2942 |
enclosed in braces, is an absolute or relative back reference. A named |
ally enclosed in braces, is an absolute or relative back reference. A |
| 2943 |
back reference can be coded as \g{name}. Back references are discussed |
named back reference can be coded as \g{name}. Back references are dis- |
| 2944 |
later, following the discussion of parenthesized subpatterns. |
cussed later, following the discussion of parenthesized subpatterns. |
| 2945 |
|
|
| 2946 |
Generic character types |
Generic character types |
| 2947 |
|
|
| 3877 |
|
|
| 3878 |
\d++foo |
\d++foo |
| 3879 |
|
|
| 3880 |
Possessive quantifiers are always greedy; the setting of the |
Note that a possessive quantifier can be used with an entire group, for |
| 3881 |
|
example: |
| 3882 |
|
|
| 3883 |
|
(abc|xyz){2,3}+ |
| 3884 |
|
|
| 3885 |
|
Possessive quantifiers are always greedy; the setting of the |
| 3886 |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
| 3887 |
simpler forms of atomic group. However, there is no difference in the |
simpler forms of atomic group. However, there is no difference in the |
| 3888 |
meaning of a possessive quantifier and the equivalent atomic group, |
meaning of a possessive quantifier and the equivalent atomic group, |
| 3889 |
though there may be a performance difference; possessive quantifiers |
though there may be a performance difference; possessive quantifiers |
| 3890 |
should be slightly faster. |
should be slightly faster. |
| 3891 |
|
|
| 3892 |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
| 3893 |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
| 3894 |
edition of his book. Mike McCloskey liked it, so implemented it when he |
edition of his book. Mike McCloskey liked it, so implemented it when he |
| 3895 |
built Sun's Java package, and PCRE copied it from there. It ultimately |
built Sun's Java package, and PCRE copied it from there. It ultimately |
| 3896 |
found its way into Perl at release 5.10. |
found its way into Perl at release 5.10. |
| 3897 |
|
|
| 3898 |
PCRE has an optimization that automatically "possessifies" certain sim- |
PCRE has an optimization that automatically "possessifies" certain sim- |
| 3899 |
ple pattern constructs. For example, the sequence A+B is treated as |
ple pattern constructs. For example, the sequence A+B is treated as |
| 3900 |
A++B because there is no point in backtracking into a sequence of A's |
A++B because there is no point in backtracking into a sequence of A's |
| 3901 |
when B must follow. |
when B must follow. |
| 3902 |
|
|
| 3903 |
When a pattern contains an unlimited repeat inside a subpattern that |
When a pattern contains an unlimited repeat inside a subpattern that |
| 3904 |
can itself be repeated an unlimited number of times, the use of an |
can itself be repeated an unlimited number of times, the use of an |
| 3905 |
atomic group is the only way to avoid some failing matches taking a |
atomic group is the only way to avoid some failing matches taking a |
| 3906 |
very long time indeed. The pattern |
very long time indeed. The pattern |
| 3907 |
|
|
| 3908 |
(\D+|<\d+>)*[!?] |
(\D+|<\d+>)*[!?] |
| 3909 |
|
|
| 3910 |
matches an unlimited number of substrings that either consist of non- |
matches an unlimited number of substrings that either consist of non- |
| 3911 |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
| 3912 |
matches, it runs quickly. However, if it is applied to |
matches, it runs quickly. However, if it is applied to |
| 3913 |
|
|
| 3914 |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
| 3915 |
|
|
| 3916 |
it takes a long time before reporting failure. This is because the |
it takes a long time before reporting failure. This is because the |
| 3917 |
string can be divided between the internal \D+ repeat and the external |
string can be divided between the internal \D+ repeat and the external |
| 3918 |
* repeat in a large number of ways, and all have to be tried. (The |
* repeat in a large number of ways, and all have to be tried. (The |
| 3919 |
example uses [!?] rather than a single character at the end, because |
example uses [!?] rather than a single character at the end, because |
| 3920 |
both PCRE and Perl have an optimization that allows for fast failure |
both PCRE and Perl have an optimization that allows for fast failure |
| 3921 |
when a single character is used. They remember the last single charac- |
when a single character is used. They remember the last single charac- |
| 3922 |
ter that is required for a match, and fail early if it is not present |
ter that is required for a match, and fail early if it is not present |
| 3923 |
in the string.) If the pattern is changed so that it uses an atomic |
in the string.) If the pattern is changed so that it uses an atomic |
| 3924 |
group, like this: |
group, like this: |
| 3925 |
|
|
| 3926 |
((?>\D+)|<\d+>)*[!?] |
((?>\D+)|<\d+>)*[!?] |
| 3927 |
|
|
| 3928 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
| 3929 |
|
|
| 3930 |
|
|
| 3931 |
BACK REFERENCES |
BACK REFERENCES |
| 3932 |
|
|
| 3933 |
Outside a character class, a backslash followed by a digit greater than |
Outside a character class, a backslash followed by a digit greater than |
| 3934 |
0 (and possibly further digits) is a back reference to a capturing sub- |
0 (and possibly further digits) is a back reference to a capturing sub- |
| 3935 |
pattern earlier (that is, to its left) in the pattern, provided there |
pattern earlier (that is, to its left) in the pattern, provided there |
| 3936 |
have been that many previous capturing left parentheses. |
have been that many previous capturing left parentheses. |
| 3937 |
|
|
| 3938 |
However, if the decimal number following the backslash is less than 10, |
However, if the decimal number following the backslash is less than 10, |
| 3939 |
it is always taken as a back reference, and causes an error only if |
it is always taken as a back reference, and causes an error only if |
| 3940 |
there are not that many capturing left parentheses in the entire pat- |
there are not that many capturing left parentheses in the entire pat- |
| 3941 |
tern. In other words, the parentheses that are referenced need not be |
tern. In other words, the parentheses that are referenced need not be |
| 3942 |
to the left of the reference for numbers less than 10. A "forward back |
to the left of the reference for numbers less than 10. A "forward back |
| 3943 |
reference" of this type can make sense when a repetition is involved |
reference" of this type can make sense when a repetition is involved |
| 3944 |
and the subpattern to the right has participated in an earlier itera- |
and the subpattern to the right has participated in an earlier itera- |
| 3945 |
tion. |
tion. |
| 3946 |
|
|
| 3947 |
It is not possible to have a numerical "forward back reference" to a |
It is not possible to have a numerical "forward back reference" to a |
| 3948 |
subpattern whose number is 10 or more using this syntax because a |
subpattern whose number is 10 or more using this syntax because a |
| 3949 |
sequence such as \50 is interpreted as a character defined in octal. |
sequence such as \50 is interpreted as a character defined in octal. |
| 3950 |
See the subsection entitled "Non-printing characters" above for further |
See the subsection entitled "Non-printing characters" above for further |
| 3951 |
details of the handling of digits following a backslash. There is no |
details of the handling of digits following a backslash. There is no |
| 3952 |
such problem when named parentheses are used. A back reference to any |
such problem when named parentheses are used. A back reference to any |
| 3953 |
subpattern is possible using named parentheses (see below). |
subpattern is possible using named parentheses (see below). |
| 3954 |
|
|
| 3955 |
Another way of avoiding the ambiguity inherent in the use of digits |
Another way of avoiding the ambiguity inherent in the use of digits |
| 3956 |
following a backslash is to use the \g escape sequence, which is a fea- |
following a backslash is to use the \g escape sequence, which is a fea- |
| 3957 |
ture introduced in Perl 5.10. This escape must be followed by a posi- |
ture introduced in Perl 5.10. This escape must be followed by an |
| 3958 |
tive or a negative number, optionally enclosed in braces. These exam- |
unsigned number or a negative number, optionally enclosed in braces. |
| 3959 |
ples are all identical: |
These examples are all identical: |
| 3960 |
|
|
| 3961 |
(ring), \1 |
(ring), \1 |
| 3962 |
(ring), \g1 |
(ring), \g1 |
| 3963 |
(ring), \g{1} |
(ring), \g{1} |
| 3964 |
|
|
| 3965 |
A positive number specifies an absolute reference without the ambiguity |
An unsigned number specifies an absolute reference without the ambigu- |
| 3966 |
that is present in the older syntax. It is also useful when literal |
ity that is present in the older syntax. It is also useful when literal |
| 3967 |
digits follow the reference. A negative number is a relative reference. |
digits follow the reference. A negative number is a relative reference. |
| 3968 |
Consider this example: |
Consider this example: |
| 3969 |
|
|
| 3970 |
(abc(def)ghi)\g{-1} |
(abc(def)ghi)\g{-1} |
| 3971 |
|
|
| 3972 |
The sequence \g{-1} is a reference to the most recently started captur- |
The sequence \g{-1} is a reference to the most recently started captur- |
| 3973 |
ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
| 3974 |
\g{-2} would be equivalent to \1. The use of relative references can be |
\g{-2} would be equivalent to \1. The use of relative references can be |
| 3975 |
helpful in long patterns, and also in patterns that are created by |
helpful in long patterns, and also in patterns that are created by |
| 3976 |
joining together fragments that contain references within themselves. |
joining together fragments that contain references within themselves. |
| 3977 |
|
|
| 3978 |
A back reference matches whatever actually matched the capturing sub- |
A back reference matches whatever actually matched the capturing sub- |
| 3979 |
pattern in the current subject string, rather than anything matching |
pattern in the current subject string, rather than anything matching |
| 3980 |
the subpattern itself (see "Subpatterns as subroutines" below for a way |
the subpattern itself (see "Subpatterns as subroutines" below for a way |
| 3981 |
of doing that). So the pattern |
of doing that). So the pattern |
| 3982 |
|
|
| 3983 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
| 3984 |
|
|
| 3985 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
| 3986 |
not "sense and responsibility". If caseful matching is in force at the |
not "sense and responsibility". If caseful matching is in force at the |
| 3987 |
time of the back reference, the case of letters is relevant. For exam- |
time of the back reference, the case of letters is relevant. For exam- |
| 3988 |
ple, |
ple, |
| 3989 |
|
|
| 3990 |
((?i)rah)\s+\1 |
((?i)rah)\s+\1 |
| 3991 |
|
|
| 3992 |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
| 3993 |
original capturing subpattern is matched caselessly. |
original capturing subpattern is matched caselessly. |
| 3994 |
|
|
| 3995 |
There are several different ways of writing back references to named |
There are several different ways of writing back references to named |
| 3996 |
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
| 3997 |
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
| 3998 |
unified back reference syntax, in which \g can be used for both numeric |
unified back reference syntax, in which \g can be used for both numeric |
| 3999 |
and named references, is also supported. We could rewrite the above |
and named references, is also supported. We could rewrite the above |
| 4000 |
example in any of the following ways: |
example in any of the following ways: |
| 4001 |
|
|
| 4002 |
(?<p1>(?i)rah)\s+\k<p1> |
(?<p1>(?i)rah)\s+\k<p1> |
| 4004 |
(?P<p1>(?i)rah)\s+(?P=p1) |
(?P<p1>(?i)rah)\s+(?P=p1) |
| 4005 |
(?<p1>(?i)rah)\s+\g{p1} |
(?<p1>(?i)rah)\s+\g{p1} |
| 4006 |
|
|
| 4007 |
A subpattern that is referenced by name may appear in the pattern |
A subpattern that is referenced by name may appear in the pattern |
| 4008 |
before or after the reference. |
before or after the reference. |
| 4009 |
|
|
| 4010 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
| 4011 |
subpattern has not actually been used in a particular match, any back |
subpattern has not actually been used in a particular match, any back |
| 4012 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
| 4013 |
|
|
| 4014 |
(a|(bc))\2 |
(a|(bc))\2 |
| 4015 |
|
|
| 4016 |
always fails if it starts to match "a" rather than "bc". Because there |
always fails if it starts to match "a" rather than "bc". Because there |
| 4017 |
may be many capturing parentheses in a pattern, all digits following |
may be many capturing parentheses in a pattern, all digits following |
| 4018 |
the backslash are taken as part of a potential back reference number. |
the backslash are taken as part of a potential back reference number. |
| 4019 |
If the pattern continues with a digit character, some delimiter must be |
If the pattern continues with a digit character, some delimiter must be |
| 4020 |
used to terminate the back reference. If the PCRE_EXTENDED option is |
used to terminate the back reference. If the PCRE_EXTENDED option is |
| 4021 |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
| 4022 |
ments" below) can be used. |
ments" below) can be used. |
| 4023 |
|
|
| 4024 |
A back reference that occurs inside the parentheses to which it refers |
A back reference that occurs inside the parentheses to which it refers |
| 4025 |
fails when the subpattern is first used, so, for example, (a\1) never |
fails when the subpattern is first used, so, for example, (a\1) never |
| 4026 |
matches. However, such references can be useful inside repeated sub- |
matches. However, such references can be useful inside repeated sub- |
| 4027 |
patterns. For example, the pattern |
patterns. For example, the pattern |
| 4028 |
|
|
| 4029 |
(a|b\1)+ |
(a|b\1)+ |
| 4030 |
|
|
| 4031 |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
| 4032 |
ation of the subpattern, the back reference matches the character |
ation of the subpattern, the back reference matches the character |
| 4033 |
string corresponding to the previous iteration. In order for this to |
string corresponding to the previous iteration. In order for this to |
| 4034 |
work, the pattern must be such that the first iteration does not need |
work, the pattern must be such that the first iteration does not need |
| 4035 |
to match the back reference. This can be done using alternation, as in |
to match the back reference. This can be done using alternation, as in |
| 4036 |
the example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of zero. |
| 4037 |
|
|
| 4038 |
|
|
| 4039 |
ASSERTIONS |
ASSERTIONS |
| 4040 |
|
|
| 4041 |
An assertion is a test on the characters following or preceding the |
An assertion is a test on the characters following or preceding the |
| 4042 |
current matching point that does not actually consume any characters. |
current matching point that does not actually consume any characters. |
| 4043 |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
| 4044 |
described above. |
described above. |
| 4045 |
|
|
| 4046 |
More complicated assertions are coded as subpatterns. There are two |
More complicated assertions are coded as subpatterns. There are two |
| 4047 |
kinds: those that look ahead of the current position in the subject |
kinds: those that look ahead of the current position in the subject |
| 4048 |
string, and those that look behind it. An assertion subpattern is |
string, and those that look behind it. An assertion subpattern is |
| 4049 |
matched in the normal way, except that it does not cause the current |
matched in the normal way, except that it does not cause the current |
| 4050 |
matching position to be changed. |
matching position to be changed. |
| 4051 |
|
|
| 4052 |
Assertion subpatterns are not capturing subpatterns, and may not be |
Assertion subpatterns are not capturing subpatterns, and may not be |
| 4053 |
repeated, because it makes no sense to assert the same thing several |
repeated, because it makes no sense to assert the same thing several |
| 4054 |
times. If any kind of assertion contains capturing subpatterns within |
times. If any kind of assertion contains capturing subpatterns within |
| 4055 |
it, these are counted for the purposes of numbering the capturing sub- |
it, these are counted for the purposes of numbering the capturing sub- |
| 4056 |
patterns in the whole pattern. However, substring capturing is carried |
patterns in the whole pattern. However, substring capturing is carried |
| 4057 |
out only for positive assertions, because it does not make sense for |
out only for positive assertions, because it does not make sense for |
| 4058 |
negative assertions. |
negative assertions. |
| 4059 |
|
|
| 4060 |
Lookahead assertions |
Lookahead assertions |
| 4064 |
|
|
| 4065 |
\w+(?=;) |
\w+(?=;) |
| 4066 |
|
|
| 4067 |
matches a word followed by a semicolon, but does not include the semi- |
matches a word followed by a semicolon, but does not include the semi- |
| 4068 |
colon in the match, and |
colon in the match, and |
| 4069 |
|
|
| 4070 |
foo(?!bar) |
foo(?!bar) |
| 4071 |
|
|
| 4072 |
matches any occurrence of "foo" that is not followed by "bar". Note |
matches any occurrence of "foo" that is not followed by "bar". Note |
| 4073 |
that the apparently similar pattern |
that the apparently similar pattern |
| 4074 |
|
|
| 4075 |
(?!foo)bar |
(?!foo)bar |
| 4076 |
|
|
| 4077 |
does not find an occurrence of "bar" that is preceded by something |
does not find an occurrence of "bar" that is preceded by something |
| 4078 |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
| 4079 |
the assertion (?!foo) is always true when the next three characters are |
the assertion (?!foo) is always true when the next three characters are |
| 4080 |
"bar". A lookbehind assertion is needed to achieve the other effect. |
"bar". A lookbehind assertion is needed to achieve the other effect. |
| 4081 |
|
|
| 4082 |
If you want to force a matching failure at some point in a pattern, the |
If you want to force a matching failure at some point in a pattern, the |
| 4083 |
most convenient way to do it is with (?!) because an empty string |
most convenient way to do it is with (?!) because an empty string |
| 4084 |
always matches, so an assertion that requires there not to be an empty |
always matches, so an assertion that requires there not to be an empty |
| 4085 |
string must always fail. |
string must always fail. |
| 4086 |
|
|
| 4087 |
Lookbehind assertions |
Lookbehind assertions |
| 4088 |
|
|
| 4089 |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
| 4090 |
for negative assertions. For example, |
for negative assertions. For example, |
| 4091 |
|
|
| 4092 |
(?<!foo)bar |
(?<!foo)bar |
| 4093 |
|
|
| 4094 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
| 4095 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
| 4096 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
| 4097 |
eral top-level alternatives, they do not all have to have the same |
eral top-level alternatives, they do not all have to have the same |
| 4098 |
fixed length. Thus |
fixed length. Thus |
| 4099 |
|
|
| 4100 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
| 4103 |
|
|
| 4104 |
(?<!dogs?|cats?) |
(?<!dogs?|cats?) |
| 4105 |
|
|
| 4106 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
| 4107 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
| 4108 |
This is an extension compared with Perl (at least for 5.8), which |
This is an extension compared with Perl (at least for 5.8), which |
| 4109 |
requires all branches to match the same length of string. An assertion |
requires all branches to match the same length of string. An assertion |
| 4110 |
such as |
such as |
| 4111 |
|
|
| 4112 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
| 4113 |
|
|
| 4114 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
| 4115 |
different lengths, but it is acceptable if rewritten to use two top- |
different lengths, but it is acceptable if rewritten to use two top- |
| 4116 |
level branches: |
level branches: |
| 4117 |
|
|
| 4118 |
(?<=abc|abde) |
(?<=abc|abde) |
| 4119 |
|
|
| 4120 |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
| 4121 |
instead of a lookbehind assertion; this is not restricted to a fixed- |
instead of a lookbehind assertion; this is not restricted to a fixed- |
| 4122 |
length. |
length. |
| 4123 |
|
|
| 4124 |
The implementation of lookbehind assertions is, for each alternative, |
The implementation of lookbehind assertions is, for each alternative, |
| 4125 |
to temporarily move the current position back by the fixed length and |
to temporarily move the current position back by the fixed length and |
| 4126 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
| 4127 |
rent position, the assertion fails. |
rent position, the assertion fails. |
| 4128 |
|
|
| 4129 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
| 4130 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
mode) to appear in lookbehind assertions, because it makes it impossi- |
| 4131 |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
| 4132 |
which can match different numbers of bytes, are also not permitted. |
which can match different numbers of bytes, are also not permitted. |
| 4133 |
|
|
| 4134 |
Possessive quantifiers can be used in conjunction with lookbehind |
Possessive quantifiers can be used in conjunction with lookbehind |
| 4135 |
assertions to specify efficient matching at the end of the subject |
assertions to specify efficient matching at the end of the subject |
| 4136 |
string. Consider a simple pattern such as |
string. Consider a simple pattern such as |
| 4137 |
|
|
| 4138 |
abcd$ |
abcd$ |
| 4139 |
|
|
| 4140 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
| 4141 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
| 4142 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
| 4143 |
pattern is specified as |
pattern is specified as |
| 4144 |
|
|
| 4145 |
^.*abcd$ |
^.*abcd$ |
| 4146 |
|
|
| 4147 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
| 4148 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
| 4149 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
| 4150 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
| 4151 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
| 4152 |
|
|
| 4153 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
| 4154 |
|
|
| 4155 |
there can be no backtracking for the .*+ item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
| 4156 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
| 4157 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
| 4158 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
| 4159 |
processing time. |
processing time. |
| 4160 |
|
|
| 4161 |
Using multiple assertions |
Using multiple assertions |
| 4164 |
|
|
| 4165 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
| 4166 |
|
|
| 4167 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
| 4168 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
| 4169 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
| 4170 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
| 4171 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
| 4172 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
| 4173 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
| 4174 |
foo". A pattern to do that is |
foo". A pattern to do that is |
| 4175 |
|
|
| 4176 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
| 4177 |
|
|
| 4178 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
| 4179 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
| 4180 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
| 4181 |
|
|
| 4183 |
|
|
| 4184 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
| 4185 |
|
|
| 4186 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
| 4187 |
is not preceded by "foo", while |
is not preceded by "foo", while |
| 4188 |
|
|
| 4189 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
| 4190 |
|
|
| 4191 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
| 4192 |
three characters that are not "999". |
three characters that are not "999". |
| 4193 |
|
|
| 4194 |
|
|
| 4195 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
| 4196 |
|
|
| 4197 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
| 4198 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
| 4199 |
on the result of an assertion, or whether a previous capturing subpat- |
on the result of an assertion, or whether a previous capturing subpat- |
| 4200 |
tern matched or not. The two possible forms of conditional subpattern |
tern matched or not. The two possible forms of conditional subpattern |
| 4201 |
are |
are |
| 4202 |
|
|
| 4203 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
| 4204 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
| 4205 |
|
|
| 4206 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
| 4207 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
| 4208 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
| 4209 |
|
|
| 4210 |
There are four kinds of condition: references to subpatterns, refer- |
There are four kinds of condition: references to subpatterns, refer- |
| 4211 |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
| 4212 |
|
|
| 4213 |
Checking for a used subpattern by number |
Checking for a used subpattern by number |
| 4214 |
|
|
| 4215 |
If the text between the parentheses consists of a sequence of digits, |
If the text between the parentheses consists of a sequence of digits, |
| 4216 |
the condition is true if the capturing subpattern of that number has |
the condition is true if the capturing subpattern of that number has |
| 4217 |
previously matched. An alternative notation is to precede the digits |
previously matched. An alternative notation is to precede the digits |
| 4218 |
with a plus or minus sign. In this case, the subpattern number is rela- |
with a plus or minus sign. In this case, the subpattern number is rela- |
| 4219 |
tive rather than absolute. The most recently opened parentheses can be |
tive rather than absolute. The most recently opened parentheses can be |
| 4220 |
referenced by (?(-1), the next most recent by (?(-2), and so on. In |
referenced by (?(-1), the next most recent by (?(-2), and so on. In |
| 4221 |
looping constructs it can also make sense to refer to subsequent groups |
looping constructs it can also make sense to refer to subsequent groups |
| 4222 |
with constructs such as (?(+2). |
with constructs such as (?(+2). |
| 4223 |
|
|
| 4224 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
| 4225 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
| 4226 |
divide it into three parts for ease of discussion: |
divide it into three parts for ease of discussion: |
| 4227 |
|
|
| 4228 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
| 4229 |
|
|
| 4230 |
The first part matches an optional opening parenthesis, and if that |
The first part matches an optional opening parenthesis, and if that |
| 4231 |
character is present, sets it as the first captured substring. The sec- |
character is present, sets it as the first captured substring. The sec- |
| 4232 |
ond part matches one or more characters that are not parentheses. The |
ond part matches one or more characters that are not parentheses. The |
| 4233 |
third part is a conditional subpattern that tests whether the first set |
third part is a conditional subpattern that tests whether the first set |
| 4234 |
of parentheses matched or not. If they did, that is, if subject started |
of parentheses matched or not. If they did, that is, if subject started |
| 4235 |
with an opening parenthesis, the condition is true, and so the yes-pat- |
with an opening parenthesis, the condition is true, and so the yes-pat- |
| 4236 |
tern is executed and a closing parenthesis is required. Otherwise, |
tern is executed and a closing parenthesis is required. Otherwise, |
| 4237 |
since no-pattern is not present, the subpattern matches nothing. In |
since no-pattern is not present, the subpattern matches nothing. In |
| 4238 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
| 4239 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. |
| 4240 |
|
|
| 4241 |
If you were embedding this pattern in a larger one, you could use a |
If you were embedding this pattern in a larger one, you could use a |
| 4242 |
relative reference: |
relative reference: |
| 4243 |
|
|
| 4244 |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
| 4245 |
|
|
| 4246 |
This makes the fragment independent of the parentheses in the larger |
This makes the fragment independent of the parentheses in the larger |
| 4247 |
pattern. |
pattern. |
| 4248 |
|
|
| 4249 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
| 4250 |
|
|
| 4251 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
| 4252 |
used subpattern by name. For compatibility with earlier versions of |
used subpattern by name. For compatibility with earlier versions of |
| 4253 |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
| 4254 |
also recognized. However, there is a possible ambiguity with this syn- |
also recognized. However, there is a possible ambiguity with this syn- |
| 4255 |
tax, because subpattern names may consist entirely of digits. PCRE |
tax, because subpattern names may consist entirely of digits. PCRE |
| 4256 |
looks first for a named subpattern; if it cannot find one and the name |
looks first for a named subpattern; if it cannot find one and the name |
| 4257 |
consists entirely of digits, PCRE looks for a subpattern of that num- |
consists entirely of digits, PCRE looks for a subpattern of that num- |
| 4258 |
ber, which must be greater than zero. Using subpattern names that con- |
ber, which must be greater than zero. Using subpattern names that con- |
| 4259 |
sist entirely of digits is not recommended. |
sist entirely of digits is not recommended. |
| 4260 |
|
|
| 4261 |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
| 4266 |
Checking for pattern recursion |
Checking for pattern recursion |
| 4267 |
|
|
| 4268 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
| 4269 |
name R, the condition is true if a recursive call to the whole pattern |
name R, the condition is true if a recursive call to the whole pattern |
| 4270 |
or any subpattern has been made. If digits or a name preceded by amper- |
or any subpattern has been made. If digits or a name preceded by amper- |
| 4271 |
sand follow the letter R, for example: |
sand follow the letter R, for example: |
| 4272 |
|
|
| 4273 |
(?(R3)...) or (?(R&name)...) |
(?(R3)...) or (?(R&name)...) |
| 4274 |
|
|
| 4275 |
the condition is true if the most recent recursion is into the subpat- |
the condition is true if the most recent recursion is into the subpat- |
| 4276 |
tern whose number or name is given. This condition does not check the |
tern whose number or name is given. This condition does not check the |
| 4277 |
entire recursion stack. |
entire recursion stack. |
| 4278 |
|
|
| 4279 |
At "top level", all these recursion test conditions are false. Recur- |
At "top level", all these recursion test conditions are false. Recur- |
| 4280 |
sive patterns are described below. |
sive patterns are described below. |
| 4281 |
|
|
| 4282 |
Defining subpatterns for use by reference only |
Defining subpatterns for use by reference only |
| 4283 |
|
|
| 4284 |
If the condition is the string (DEFINE), and there is no subpattern |
If the condition is the string (DEFINE), and there is no subpattern |
| 4285 |
with the name DEFINE, the condition is always false. In this case, |
with the name DEFINE, the condition is always false. In this case, |
| 4286 |
there may be only one alternative in the subpattern. It is always |
there may be only one alternative in the subpattern. It is always |
| 4287 |
skipped if control reaches this point in the pattern; the idea of |
skipped if control reaches this point in the pattern; the idea of |
| 4288 |
DEFINE is that it can be used to define "subroutines" that can be ref- |
DEFINE is that it can be used to define "subroutines" that can be ref- |
| 4289 |
erenced from elsewhere. (The use of "subroutines" is described below.) |
erenced from elsewhere. (The use of "subroutines" is described below.) |
| 4290 |
For example, a pattern to match an IPv4 address could be written like |
For example, a pattern to match an IPv4 address could be written like |
| 4291 |
this (ignore whitespace and line breaks): |
this (ignore whitespace and line breaks): |
| 4292 |
|
|
| 4293 |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| 4294 |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
| 4295 |
|
|
| 4296 |
The first part of the pattern is a DEFINE group inside which a another |
The first part of the pattern is a DEFINE group inside which a another |
| 4297 |
group named "byte" is defined. This matches an individual component of |
group named "byte" is defined. This matches an individual component of |
| 4298 |
an IPv4 address (a number less than 256). When matching takes place, |
an IPv4 address (a number less than 256). When matching takes place, |
| 4299 |
this part of the pattern is skipped because DEFINE acts like a false |
this part of the pattern is skipped because DEFINE acts like a false |
| 4300 |
condition. |
condition. |
| 4301 |
|
|
| 4302 |
The rest of the pattern uses references to the named group to match the |
The rest of the pattern uses references to the named group to match the |
| 4303 |
four dot-separated components of an IPv4 address, insisting on a word |
four dot-separated components of an IPv4 address, insisting on a word |
| 4304 |
boundary at each end. |
boundary at each end. |
| 4305 |
|
|
| 4306 |
Assertion conditions |
Assertion conditions |
| 4307 |
|
|
| 4308 |
If the condition is not in any of the above formats, it must be an |
If the condition is not in any of the above formats, it must be an |
| 4309 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
| 4310 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
| 4311 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
| 4312 |
|
|
| 4313 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
| 4314 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
| 4315 |
|
|
| 4316 |
The condition is a positive lookahead assertion that matches an |
The condition is a positive lookahead assertion that matches an |
| 4317 |
optional sequence of non-letters followed by a letter. In other words, |
optional sequence of non-letters followed by a letter. In other words, |
| 4318 |
it tests for the presence of at least one letter in the subject. If a |
it tests for the presence of at least one letter in the subject. If a |
| 4319 |
letter is found, the subject is matched against the first alternative; |
letter is found, the subject is matched against the first alternative; |
| 4320 |
otherwise it is matched against the second. This pattern matches |
otherwise it is matched against the second. This pattern matches |
| 4321 |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
| 4322 |
letters and dd are digits. |
letters and dd are digits. |
| 4323 |
|
|
| 4324 |
|
|
| 4325 |
COMMENTS |
COMMENTS |
| 4326 |
|
|
| 4327 |
The sequence (?# marks the start of a comment that continues up to the |
The sequence (?# marks the start of a comment that continues up to the |
| 4328 |
next closing parenthesis. Nested parentheses are not permitted. The |
next closing parenthesis. Nested parentheses are not permitted. The |
| 4329 |
characters that make up a comment play no part in the pattern matching |
characters that make up a comment play no part in the pattern matching |
| 4330 |
at all. |
at all. |
| 4331 |
|
|
| 4332 |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
| 4333 |
character class introduces a comment that continues to immediately |
character class introduces a comment that continues to immediately |
| 4334 |
after the next newline in the pattern. |
after the next newline in the pattern. |
| 4335 |
|
|
| 4336 |
|
|
| 4337 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
| 4338 |
|
|
| 4339 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
| 4340 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
| 4341 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
| 4342 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
| 4343 |
depth. |
depth. |
| 4344 |
|
|
| 4345 |
For some time, Perl has provided a facility that allows regular expres- |
For some time, Perl has provided a facility that allows regular expres- |
| 4346 |
sions to recurse (amongst other things). It does this by interpolating |
sions to recurse (amongst other things). It does this by interpolating |
| 4347 |
Perl code in the expression at run time, and the code can refer to the |
Perl code in the expression at run time, and the code can refer to the |
| 4348 |
expression itself. A Perl pattern using code interpolation to solve the |
expression itself. A Perl pattern using code interpolation to solve the |
| 4349 |
parentheses problem can be created like this: |
parentheses problem can be created like this: |
| 4350 |
|
|
| 4354 |
refers recursively to the pattern in which it appears. |
refers recursively to the pattern in which it appears. |
| 4355 |
|
|
| 4356 |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
| 4357 |
it supports special syntax for recursion of the entire pattern, and |
it supports special syntax for recursion of the entire pattern, and |
| 4358 |
also for individual subpattern recursion. After its introduction in |
also for individual subpattern recursion. After its introduction in |
| 4359 |
PCRE and Python, this kind of recursion was introduced into Perl at |
PCRE and Python, this kind of recursion was introduced into Perl at |
| 4360 |
release 5.10. |
release 5.10. |
| 4361 |
|
|
| 4362 |
A special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
| 4363 |
zero and a closing parenthesis is a recursive call of the subpattern of |
zero and a closing parenthesis is a recursive call of the subpattern of |
| 4364 |
the given number, provided that it occurs inside that subpattern. (If |
the given number, provided that it occurs inside that subpattern. (If |
| 4365 |
not, it is a "subroutine" call, which is described in the next sec- |
not, it is a "subroutine" call, which is described in the next sec- |
| 4366 |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
| 4367 |
regular expression. |
regular expression. |
| 4368 |
|
|
| 4369 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
| 4370 |
always treated as an atomic group. That is, once it has matched some of |
always treated as an atomic group. That is, once it has matched some of |
| 4371 |
the subject string, it is never re-entered, even if it contains untried |
the subject string, it is never re-entered, even if it contains untried |
| 4372 |
alternatives and there is a subsequent matching failure. |
alternatives and there is a subsequent matching failure. |
| 4373 |
|
|
| 4374 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
| 4375 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
| 4376 |
|
|
| 4377 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
| 4378 |
|
|
| 4379 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
| 4380 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
| 4381 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
| 4382 |
sized substring). Finally there is a closing parenthesis. |
sized substring). Finally there is a closing parenthesis. |
| 4383 |
|
|
| 4384 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
| 4385 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
| 4386 |
|
|
| 4387 |
( \( ( (?>[^()]+) | (?1) )* \) ) |
( \( ( (?>[^()]+) | (?1) )* \) ) |
| 4388 |
|
|
| 4389 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
| 4390 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
| 4391 |
|
|
| 4392 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
| 4393 |
tricky. This is made easier by the use of relative references. (A Perl |
tricky. This is made easier by the use of relative references. (A Perl |
| 4394 |
5.10 feature.) Instead of (?1) in the pattern above you can write |
5.10 feature.) Instead of (?1) in the pattern above you can write |
| 4395 |
(?-2) to refer to the second most recently opened parentheses preceding |
(?-2) to refer to the second most recently opened parentheses preceding |
| 4396 |
the recursion. In other words, a negative number counts capturing |
the recursion. In other words, a negative number counts capturing |
| 4397 |
parentheses leftwards from the point at which it is encountered. |
parentheses leftwards from the point at which it is encountered. |
| 4398 |
|
|
| 4399 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
| 4400 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
| 4401 |
because the reference is not inside the parentheses that are refer- |
because the reference is not inside the parentheses that are refer- |
| 4402 |
enced. They are always "subroutine" calls, as described in the next |
enced. They are always "subroutine" calls, as described in the next |
| 4403 |
section. |
section. |
| 4404 |
|
|
| 4405 |
An alternative approach is to use named parentheses instead. The Perl |
An alternative approach is to use named parentheses instead. The Perl |
| 4406 |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
| 4407 |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
| 4408 |
|
|
| 4409 |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
| 4410 |
|
|
| 4411 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
| 4412 |
one is used. |
one is used. |
| 4413 |
|
|
| 4414 |
This particular example pattern that we have been looking at contains |
This particular example pattern that we have been looking at contains |
| 4415 |
nested unlimited repeats, and so the use of atomic grouping for match- |
nested unlimited repeats, and so the use of atomic grouping for match- |
| 4416 |
ing strings of non-parentheses is important when applying the pattern |
ing strings of non-parentheses is important when applying the pattern |
| 4417 |
to strings that do not match. For example, when this pattern is applied |
to strings that do not match. For example, when this pattern is applied |
| 4418 |
to |
to |
| 4419 |
|
|
| 4420 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| 4421 |
|
|
| 4422 |
it yields "no match" quickly. However, if atomic grouping is not used, |
it yields "no match" quickly. However, if atomic grouping is not used, |
| 4423 |
the match runs for a very long time indeed because there are so many |
the match runs for a very long time indeed because there are so many |
| 4424 |
different ways the + and * repeats can carve up the subject, and all |
different ways the + and * repeats can carve up the subject, and all |
| 4425 |
have to be tested before failure can be reported. |
have to be tested before failure can be reported. |
| 4426 |
|
|
| 4427 |
At the end of a match, the values set for any capturing subpatterns are |
At the end of a match, the values set for any capturing subpatterns are |
| 4428 |
those from the outermost level of the recursion at which the subpattern |
those from the outermost level of the recursion at which the subpattern |
| 4429 |
value is set. If you want to obtain intermediate values, a callout |
value is set. If you want to obtain intermediate values, a callout |
| 4430 |
function can be used (see below and the pcrecallout documentation). If |
function can be used (see below and the pcrecallout documentation). If |
| 4431 |
the pattern above is matched against |
the pattern above is matched against |
| 4432 |
|
|
| 4433 |
(ab(cd)ef) |
(ab(cd)ef) |
| 4434 |
|
|
| 4435 |
the value for the capturing parentheses is "ef", which is the last |
the value for the capturing parentheses is "ef", which is the last |
| 4436 |
value taken on at the top level. If additional parentheses are added, |
value taken on at the top level. If additional parentheses are added, |
| 4437 |
giving |
giving |
| 4438 |
|
|
| 4439 |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
| 4440 |
^ ^ |
^ ^ |
| 4441 |
^ ^ |
^ ^ |
| 4442 |
|
|
| 4443 |
the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
| 4444 |
parentheses. If there are more than 15 capturing parentheses in a pat- |
parentheses. If there are more than 15 capturing parentheses in a pat- |
| 4445 |
tern, PCRE has to obtain extra memory to store data during a recursion, |
tern, PCRE has to obtain extra memory to store data during a recursion, |
| 4446 |
which it does by using pcre_malloc, freeing it via pcre_free after- |
which it does by using pcre_malloc, freeing it via pcre_free after- |
| 4447 |
wards. If no memory can be obtained, the match fails with the |
wards. If no memory can be obtained, the match fails with the |
| 4448 |
PCRE_ERROR_NOMEMORY error. |
PCRE_ERROR_NOMEMORY error. |
| 4449 |
|
|
| 4450 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
| 4451 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
| 4452 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
| 4453 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
| 4454 |
ted at the outer level. |
ted at the outer level. |
| 4455 |
|
|
| 4456 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
| 4457 |
|
|
| 4458 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
| 4459 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
| 4460 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
| 4461 |
|
|
| 4462 |
|
|
| 4463 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
| 4464 |
|
|
| 4465 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
| 4466 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
| 4467 |
ates like a subroutine in a programming language. The "called" subpat- |
ates like a subroutine in a programming language. The "called" subpat- |
| 4468 |
tern may be defined before or after the reference. A numbered reference |
tern may be defined before or after the reference. A numbered reference |
| 4469 |
can be absolute or relative, as in these examples: |
can be absolute or relative, as in these examples: |
| 4470 |
|
|
| 4476 |
|
|
| 4477 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
| 4478 |
|
|
| 4479 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
| 4480 |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
| 4481 |
|
|
| 4482 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
| 4483 |
|
|
| 4484 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
| 4485 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
| 4486 |
above. |
above. |
| 4487 |
|
|
| 4488 |
Like recursive subpatterns, a "subroutine" call is always treated as an |
Like recursive subpatterns, a "subroutine" call is always treated as an |
| 4489 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
| 4490 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
| 4491 |
there is a subsequent matching failure. |
there is a subsequent matching failure. |
| 4492 |
|
|
| 4493 |
When a subpattern is used as a subroutine, processing options such as |
When a subpattern is used as a subroutine, processing options such as |
| 4494 |
case-independence are fixed when the subpattern is defined. They cannot |
case-independence are fixed when the subpattern is defined. They cannot |
| 4495 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
| 4496 |
|
|
| 4497 |
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
| 4498 |
|
|
| 4499 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
| 4500 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
| 4501 |
|
|
| 4502 |
|
|
| 4503 |
CALLOUTS |
CALLOUTS |
| 4504 |
|
|
| 4505 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
| 4506 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
| 4507 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
| 4508 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
| 4509 |
tion. |
tion. |
| 4510 |
|
|
| 4511 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
| 4512 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
| 4513 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
| 4514 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
| 4515 |
all calling out. |
all calling out. |
| 4516 |
|
|
| 4517 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
| 4518 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
| 4519 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
| 4520 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
| 4521 |
points: |
points: |
| 4522 |
|
|
| 4523 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
| 4524 |
|
|
| 4525 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
| 4526 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
| 4527 |
numbered 255. |
numbered 255. |
| 4528 |
|
|
| 4529 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
| 4530 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
| 4531 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
| 4532 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
| 4533 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
| 4534 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
| 4535 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
| 4536 |
|
|
| 4549 |
|
|
| 4550 |
REVISION |
REVISION |
| 4551 |
|
|
| 4552 |
Last updated: 19 June 2007 |
Last updated: 06 August 2007 |
| 4553 |
|
Copyright (c) 1997-2007 University of Cambridge. |
| 4554 |
|
------------------------------------------------------------------------------ |
| 4555 |
|
|
| 4556 |
|
|
| 4557 |
|
PCRESYNTAX(3) PCRESYNTAX(3) |
| 4558 |
|
|
| 4559 |
|
|
| 4560 |
|
NAME |
| 4561 |
|
PCRE - Perl-compatible regular expressions |
| 4562 |
|
|
| 4563 |
|
|
| 4564 |
|
PCRE REGULAR EXPRESSION SYNTAX SUMMARY |
| 4565 |
|
|
| 4566 |
|
The full syntax and semantics of the regular expressions that are sup- |
| 4567 |
|
ported by PCRE are described in the pcrepattern documentation. This |
| 4568 |
|
document contains just a quick-reference summary of the syntax. |
| 4569 |
|
|
| 4570 |
|
|
| 4571 |
|
QUOTING |
| 4572 |
|
|
| 4573 |
|
\x where x is non-alphanumeric is a literal x |
| 4574 |
|
\Q...\E treat enclosed characters as literal |
| 4575 |
|
|
| 4576 |
|
|
| 4577 |
|
CHARACTERS |
| 4578 |
|
|
| 4579 |
|
\a alarm, that is, the BEL character (hex 07) |
| 4580 |
|
\cx "control-x", where x is any character |
| 4581 |
|
\e escape (hex 1B) |
| 4582 |
|
\f formfeed (hex 0C) |
| 4583 |
|
\n newline (hex 0A) |
| 4584 |
|
\r carriage return (hex 0D) |
| 4585 |
|
\t tab (hex 09) |
| 4586 |
|
\ddd character with octal code ddd, or backreference |
| 4587 |
|
\xhh character with hex code hh |
| 4588 |
|
\x{hhh..} character with hex code hhh.. |
| 4589 |
|
|
| 4590 |
|
|
| 4591 |
|
CHARACTER TYPES |
| 4592 |
|
|
| 4593 |
|
. any character except newline; |
| 4594 |
|
in dotall mode, any character whatsoever |
| 4595 |
|
\C one byte, even in UTF-8 mode (best avoided) |
| 4596 |
|
\d a decimal digit |
| 4597 |
|
\D a character that is not a decimal digit |
| 4598 |
|
\h a horizontal whitespace character |
| 4599 |
|
\H a character that is not a horizontal whitespace character |
| 4600 |
|
\p{xx} a character with the xx property |
| 4601 |
|
\P{xx} a character without the xx property |
| 4602 |
|
\R a newline sequence |
| 4603 |
|
\s a whitespace character |
| 4604 |
|
\S a character that is not a whitespace character |
| 4605 |
|
\v a vertical whitespace character |
| 4606 |
|
\V a character that is not a vertical whitespace character |
| 4607 |
|
\w a "word" character |
| 4608 |
|
\W a "non-word" character |
| 4609 |
|
\X an extended Unicode sequence |
| 4610 |
|
|
| 4611 |
|
In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters. |
| 4612 |
|
|
| 4613 |
|
|
| 4614 |
|
GENERAL CATEGORY PROPERTY CODES FOR \p and \P |
| 4615 |
|
|
| 4616 |
|
C Other |
| 4617 |
|
Cc Control |
| 4618 |
|
Cf Format |
| 4619 |
|
Cn Unassigned |
| 4620 |
|
Co Private use |
| 4621 |
|
Cs Surrogate |
| 4622 |
|
|
| 4623 |
|
L Letter |
| 4624 |
|
Ll Lower case letter |
| 4625 |
|
Lm Modifier letter |
| 4626 |
|
Lo Other letter |
| 4627 |
|
Lt Title case letter |
| 4628 |
|
Lu Upper case letter |
| 4629 |
|
L& Ll, Lu, or Lt |
| 4630 |
|
|
| 4631 |
|
M Mark |
| 4632 |
|
Mc Spacing mark |
| 4633 |
|
Me Enclosing mark |
| 4634 |
|
Mn Non-spacing mark |
| 4635 |
|
|
| 4636 |
|
N Number |
| 4637 |
|
Nd Decimal number |
| 4638 |
|
Nl Letter number |
| 4639 |
|
No Other number |
| 4640 |
|
|
| 4641 |
|
P Punctuation |
| 4642 |
|
Pc Connector punctuation |
| 4643 |
|
Pd Dash punctuation |
| 4644 |
|
Pe Close punctuation |
| 4645 |
|
Pf Final punctuation |
| 4646 |
|
Pi Initial punctuation |
| 4647 |
|
Po Other punctuation |
| 4648 |
|
Ps Open punctuation |
| 4649 |
|
|
| 4650 |
|
S Symbol |
| 4651 |
|
Sc Currency symbol |
| 4652 |
|
Sk Modifier symbol |
| 4653 |
|
Sm Mathematical symbol |
| 4654 |
|
So Other symbol |
| 4655 |
|
|
| 4656 |
|
Z Separator |
| 4657 |
|
Zl Line separator |
| 4658 |
|
Zp Paragraph separator |
| 4659 |
|
Zs Space separator |
| 4660 |
|
|
| 4661 |
|
|
| 4662 |
|
SCRIPT NAMES FOR \p AND \P |
| 4663 |
|
|
| 4664 |
|
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
| 4665 |
|
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
| 4666 |
|
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
| 4667 |
|
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
| 4668 |
|
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
| 4669 |
|
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
| 4670 |
|
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
| 4671 |
|
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
| 4672 |
|
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
| 4673 |
|
|
| 4674 |
|
|
| 4675 |
|
CHARACTER CLASSES |
| 4676 |
|
|
| 4677 |
|
[...] positive character class |
| 4678 |
|
[^...] negative character class |
| 4679 |
|
[x-y] range (can be used for hex characters) |
| 4680 |
|
[[:xxx:]] positive POSIX named set |
| 4681 |
|
[[^:xxx:]] negative POSIX named set |
| 4682 |
|
|
| 4683 |
|
alnum alphanumeric |
| 4684 |
|
alpha alphabetic |
| 4685 |
|
ascii 0-127 |
| 4686 |
|
blank space or tab |
| 4687 |
|
cntrl control character |
| 4688 |
|
digit decimal digit |
| 4689 |
|
graph printing, excluding space |
| 4690 |
|
lower lower case letter |
| 4691 |
|
print printing, including space |
| 4692 |
|
punct printing, excluding alphanumeric |
| 4693 |
|
space whitespace |
| 4694 |
|
upper upper case letter |
| 4695 |
|
word same as \w |
| 4696 |
|
xdigit hexadecimal digit |
| 4697 |
|
|
| 4698 |
|
In PCRE, POSIX character set names recognize only ASCII characters. You |
| 4699 |
|
can use \Q...\E inside a character class. |
| 4700 |
|
|
| 4701 |
|
|
| 4702 |
|
QUANTIFIERS |
| 4703 |
|
|
| 4704 |
|
? 0 or 1, greedy |
| 4705 |
|
?+ 0 or 1, possessive |
| 4706 |
|
?? 0 or 1, lazy |
| 4707 |
|
* 0 or more, greedy |
| 4708 |
|
*+ 0 or more, possessive |
| 4709 |
|
*? 0 or more, lazy |
| 4710 |
|
+ 1 or more, greedy |
| 4711 |
|
++ 1 or more, possessive |
| 4712 |
|
+? 1 or more, lazy |
| 4713 |
|
{n} exactly n |
| 4714 |
|
{n,m} at least n, no more than m, greedy |
| 4715 |
|
{n,m}+ at least n, no more than m, possessive |
| 4716 |
|
{n,m}? at least n, no more than m, lazy |
| 4717 |
|
{n,} n or more, greedy |
| 4718 |
|
{n,}+ n or more, possessive |
| 4719 |
|
{n,}? n or more, lazy |
| 4720 |
|
|
| 4721 |
|
|
| 4722 |
|
ANCHORS AND SIMPLE ASSERTIONS |
| 4723 |
|
|
| 4724 |
|
\b word boundary |
| 4725 |
|
\B not a word boundary |
| 4726 |
|
^ start of subject |
| 4727 |
|
also after internal newline in multiline mode |
| 4728 |
|
\A start of subject |
| 4729 |
|
$ end of subject |
| 4730 |
|
also before newline at end of subject |
| 4731 |
|
also before internal newline in multiline mode |
| 4732 |
|
\Z end of subject |
| 4733 |
|
also before newline at end of subject |
| 4734 |
|
\z end of subject |
| 4735 |
|
\G first matching position in subject |
| 4736 |
|
|
| 4737 |
|
|
| 4738 |
|
MATCH POINT RESET |
| 4739 |
|
|
| 4740 |
|
\K reset start of match |
| 4741 |
|
|
| 4742 |
|
|
| 4743 |
|
ALTERNATION |
| 4744 |
|
|
| 4745 |
|
expr|expr|expr... |
| 4746 |
|
|
| 4747 |
|
|
| 4748 |
|
CAPTURING |
| 4749 |
|
|
| 4750 |
|
(...) capturing group |
| 4751 |
|
(?<name>...) named capturing group (Perl) |
| 4752 |
|
(?'name'...) named capturing group (Perl) |
| 4753 |
|
(?P<name>...) named capturing group (Python) |
| 4754 |
|
(?:...) non-capturing group |
| 4755 |
|
(?|...) non-capturing group; reset group numbers for |
| 4756 |
|
capturing groups in each alternative |
| 4757 |
|
|
| 4758 |
|
|
| 4759 |
|
ATOMIC GROUPS |
| 4760 |
|
|
| 4761 |
|
(?>...) atomic, non-capturing group |
| 4762 |
|
|
| 4763 |
|
|
| 4764 |
|
COMMENT |
| 4765 |
|
|
| 4766 |
|
(?#....) comment (not nestable) |
| 4767 |
|
|
| 4768 |
|
|
| 4769 |
|
OPTION SETTING |
| 4770 |
|
|
| 4771 |
|
(?i) caseless |
| 4772 |
|
(?J) allow duplicate names |
| 4773 |
|
(?m) multiline |
| 4774 |
|
(?s) single line (dotall) |
| 4775 |
|
(?U) default ungreedy (lazy) |
| 4776 |
|
(?x) extended (ignore white space) |
| 4777 |
|
(?-...) unset option(s) |
| 4778 |
|
|
| 4779 |
|
|
| 4780 |
|
LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
| 4781 |
|
|
| 4782 |
|
(?=...) positive look ahead |
| 4783 |
|
(?!...) negative look ahead |
| 4784 |
|
(?<=...) positive look behind |
| 4785 |
|
(?<!...) negative look behind |
| 4786 |
|
|
| 4787 |
|
Each top-level branch of a look behind must be of a fixed length. |
| 4788 |
|
|
| 4789 |
|
|
| 4790 |
|
BACKREFERENCES |
| 4791 |
|
|
| 4792 |
|
\n reference by number (can be ambiguous) |
| 4793 |
|
\gn reference by number |
| 4794 |
|
\g{n} reference by number |
| 4795 |
|
\g{-n} relative reference by number |
| 4796 |
|
\k<name> reference by name (Perl) |
| 4797 |
|
\k'name' reference by name (Perl) |
| 4798 |
|
\g{name} reference by name (Perl) |
| 4799 |
|
\k{name} reference by name (.NET) |
| 4800 |
|
(?P=name) reference by name (Python) |
| 4801 |
|
|
| 4802 |
|
|
| 4803 |
|
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) |
| 4804 |
|
|
| 4805 |
|
(?R) recurse whole pattern |
| 4806 |
|
(?n) call subpattern by absolute number |
| 4807 |
|
(?+n) call subpattern by relative number |
| 4808 |
|
(?-n) call subpattern by relative number |
| 4809 |
|
(?&name) call subpattern by name (Perl) |
| 4810 |
|
(?P>name) call subpattern by name (Python) |
| 4811 |
|
|
| 4812 |
|
|
| 4813 |
|
CONDITIONAL PATTERNS |
| 4814 |
|
|
| 4815 |
|
(?(condition)yes-pattern) |
| 4816 |
|
(?(condition)yes-pattern|no-pattern) |
| 4817 |
|
|
| 4818 |
|
(?(n)... absolute reference condition |
| 4819 |
|
(?(+n)... relative reference condition |
| 4820 |
|
(?(-n)... relative reference condition |
| 4821 |
|
(?(<name>)... named reference condition (Perl) |
| 4822 |
|
(?('name')... named reference condition (Perl) |
| 4823 |
|
(?(name)... named reference condition (PCRE) |
| 4824 |
|
(?(R)... overall recursion condition |
| 4825 |
|
(?(Rn)... specific group recursion condition |
| 4826 |
|
(?(R&name)... specific recursion condition |
| 4827 |
|
(?(DEFINE)... define subpattern for reference |
| 4828 |
|
(?(assert)... assertion condition |
| 4829 |
|
|
| 4830 |
|
|
| 4831 |
|
CALLOUTS |
| 4832 |
|
|
| 4833 |
|
(?C) callout |
| 4834 |
|
(?Cn) callout with data n |
| 4835 |
|
|
| 4836 |
|
|
| 4837 |
|
SEE ALSO |
| 4838 |
|
|
| 4839 |
|
pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). |
| 4840 |
|
|
| 4841 |
|
|
| 4842 |
|
AUTHOR |
| 4843 |
|
|
| 4844 |
|
Philip Hazel |
| 4845 |
|
University Computing Service |
| 4846 |
|
Cambridge CB2 3QH, England. |
| 4847 |
|
|
| 4848 |
|
|
| 4849 |
|
REVISION |
| 4850 |
|
|
| 4851 |
|
Last updated: 06 August 2007 |
| 4852 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 4853 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4854 |
|
|