| 2911 |
The newline convention does not affect what the \R escape sequence |
The newline convention does not affect what the \R escape sequence |
| 2912 |
matches. By default, this is any Unicode newline sequence, for Perl |
matches. By default, this is any Unicode newline sequence, for Perl |
| 2913 |
compatibility. However, this can be changed; see the description of \R |
compatibility. However, this can be changed; see the description of \R |
| 2914 |
in the section entitled "Newline sequences" below. |
in the section entitled "Newline sequences" below. A change of \R set- |
| 2915 |
|
ting can be combined with a change of newline convention. |
| 2916 |
|
|
| 2917 |
|
|
| 2918 |
CHARACTERS AND METACHARACTERS |
CHARACTERS AND METACHARACTERS |
| 2919 |
|
|
| 2920 |
A regular expression is a pattern that is matched against a subject |
A regular expression is a pattern that is matched against a subject |
| 2921 |
string from left to right. Most characters stand for themselves in a |
string from left to right. Most characters stand for themselves in a |
| 2922 |
pattern, and match the corresponding characters in the subject. As a |
pattern, and match the corresponding characters in the subject. As a |
| 2923 |
trivial example, the pattern |
trivial example, the pattern |
| 2924 |
|
|
| 2925 |
The quick brown fox |
The quick brown fox |
| 2926 |
|
|
| 2927 |
matches a portion of a subject string that is identical to itself. When |
matches a portion of a subject string that is identical to itself. When |
| 2928 |
caseless matching is specified (the PCRE_CASELESS option), letters are |
caseless matching is specified (the PCRE_CASELESS option), letters are |
| 2929 |
matched independently of case. In UTF-8 mode, PCRE always understands |
matched independently of case. In UTF-8 mode, PCRE always understands |
| 2930 |
the concept of case for characters whose values are less than 128, so |
the concept of case for characters whose values are less than 128, so |
| 2931 |
caseless matching is always possible. For characters with higher val- |
caseless matching is always possible. For characters with higher val- |
| 2932 |
ues, the concept of case is supported if PCRE is compiled with Unicode |
ues, the concept of case is supported if PCRE is compiled with Unicode |
| 2933 |
property support, but not otherwise. If you want to use caseless |
property support, but not otherwise. If you want to use caseless |
| 2934 |
matching for characters 128 and above, you must ensure that PCRE is |
matching for characters 128 and above, you must ensure that PCRE is |
| 2935 |
compiled with Unicode property support as well as with UTF-8 support. |
compiled with Unicode property support as well as with UTF-8 support. |
| 2936 |
|
|
| 2937 |
The power of regular expressions comes from the ability to include |
The power of regular expressions comes from the ability to include |
| 2938 |
alternatives and repetitions in the pattern. These are encoded in the |
alternatives and repetitions in the pattern. These are encoded in the |
| 2939 |
pattern by the use of metacharacters, which do not stand for themselves |
pattern by the use of metacharacters, which do not stand for themselves |
| 2940 |
but instead are interpreted in some special way. |
but instead are interpreted in some special way. |
| 2941 |
|
|
| 2942 |
There are two different sets of metacharacters: those that are recog- |
There are two different sets of metacharacters: those that are recog- |
| 2943 |
nized anywhere in the pattern except within square brackets, and those |
nized anywhere in the pattern except within square brackets, and those |
| 2944 |
that are recognized within square brackets. Outside square brackets, |
that are recognized within square brackets. Outside square brackets, |
| 2945 |
the metacharacters are as follows: |
the metacharacters are as follows: |
| 2946 |
|
|
| 2947 |
\ general escape character with several uses |
\ general escape character with several uses |
| 2960 |
also "possessive quantifier" |
also "possessive quantifier" |
| 2961 |
{ start min/max quantifier |
{ start min/max quantifier |
| 2962 |
|
|
| 2963 |
Part of a pattern that is in square brackets is called a "character |
Part of a pattern that is in square brackets is called a "character |
| 2964 |
class". In a character class the only metacharacters are: |
class". In a character class the only metacharacters are: |
| 2965 |
|
|
| 2966 |
\ general escape character |
\ general escape character |
| 2970 |
syntax) |
syntax) |
| 2971 |
] terminates the character class |
] terminates the character class |
| 2972 |
|
|
| 2973 |
The following sections describe the use of each of the metacharacters. |
The following sections describe the use of each of the metacharacters. |
| 2974 |
|
|
| 2975 |
|
|
| 2976 |
BACKSLASH |
BACKSLASH |
| 2977 |
|
|
| 2978 |
The backslash character has several uses. Firstly, if it is followed by |
The backslash character has several uses. Firstly, if it is followed by |
| 2979 |
a non-alphanumeric character, it takes away any special meaning that |
a non-alphanumeric character, it takes away any special meaning that |
| 2980 |
character may have. This use of backslash as an escape character |
character may have. This use of backslash as an escape character |
| 2981 |
applies both inside and outside character classes. |
applies both inside and outside character classes. |
| 2982 |
|
|
| 2983 |
For example, if you want to match a * character, you write \* in the |
For example, if you want to match a * character, you write \* in the |
| 2984 |
pattern. This escaping action applies whether or not the following |
pattern. This escaping action applies whether or not the following |
| 2985 |
character would otherwise be interpreted as a metacharacter, so it is |
character would otherwise be interpreted as a metacharacter, so it is |
| 2986 |
always safe to precede a non-alphanumeric with backslash to specify |
always safe to precede a non-alphanumeric with backslash to specify |
| 2987 |
that it stands for itself. In particular, if you want to match a back- |
that it stands for itself. In particular, if you want to match a back- |
| 2988 |
slash, you write \\. |
slash, you write \\. |
| 2989 |
|
|
| 2990 |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
| 2991 |
the pattern (other than in a character class) and characters between a |
the pattern (other than in a character class) and characters between a |
| 2992 |
# outside a character class and the next newline are ignored. An escap- |
# outside a character class and the next newline are ignored. An escap- |
| 2993 |
ing backslash can be used to include a whitespace or # character as |
ing backslash can be used to include a whitespace or # character as |
| 2994 |
part of the pattern. |
part of the pattern. |
| 2995 |
|
|
| 2996 |
If you want to remove the special meaning from a sequence of charac- |
If you want to remove the special meaning from a sequence of charac- |
| 2997 |
ters, you can do so by putting them between \Q and \E. This is differ- |
ters, you can do so by putting them between \Q and \E. This is differ- |
| 2998 |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
| 2999 |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
| 3000 |
tion. Note the following examples: |
tion. Note the following examples: |
| 3001 |
|
|
| 3002 |
Pattern PCRE matches Perl matches |
Pattern PCRE matches Perl matches |
| 3006 |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
| 3007 |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
| 3008 |
|
|
| 3009 |
The \Q...\E sequence is recognized both inside and outside character |
The \Q...\E sequence is recognized both inside and outside character |
| 3010 |
classes. |
classes. |
| 3011 |
|
|
| 3012 |
Non-printing characters |
Non-printing characters |
| 3013 |
|
|
| 3014 |
A second use of backslash provides a way of encoding non-printing char- |
A second use of backslash provides a way of encoding non-printing char- |
| 3015 |
acters in patterns in a visible manner. There is no restriction on the |
acters in patterns in a visible manner. There is no restriction on the |
| 3016 |
appearance of non-printing characters, apart from the binary zero that |
appearance of non-printing characters, apart from the binary zero that |
| 3017 |
terminates a pattern, but when a pattern is being prepared by text |
terminates a pattern, but when a pattern is being prepared by text |
| 3018 |
editing, it is usually easier to use one of the following escape |
editing, it is usually easier to use one of the following escape |
| 3019 |
sequences than the binary character it represents: |
sequences than the binary character it represents: |
| 3020 |
|
|
| 3021 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
| 3029 |
\xhh character with hex code hh |
\xhh character with hex code hh |
| 3030 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. |
| 3031 |
|
|
| 3032 |
The precise effect of \cx is as follows: if x is a lower case letter, |
The precise effect of \cx is as follows: if x is a lower case letter, |
| 3033 |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
| 3034 |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
| 3035 |
becomes hex 7B. |
becomes hex 7B. |
| 3036 |
|
|
| 3037 |
After \x, from zero to two hexadecimal digits are read (letters can be |
After \x, from zero to two hexadecimal digits are read (letters can be |
| 3038 |
in upper or lower case). Any number of hexadecimal digits may appear |
in upper or lower case). Any number of hexadecimal digits may appear |
| 3039 |
between \x{ and }, but the value of the character code must be less |
between \x{ and }, but the value of the character code must be less |
| 3040 |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
| 3041 |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
| 3042 |
than the largest Unicode code point, which is 10FFFF. |
than the largest Unicode code point, which is 10FFFF. |
| 3043 |
|
|
| 3044 |
If characters other than hexadecimal digits appear between \x{ and }, |
If characters other than hexadecimal digits appear between \x{ and }, |
| 3045 |
or if there is no terminating }, this form of escape is not recognized. |
or if there is no terminating }, this form of escape is not recognized. |
| 3046 |
Instead, the initial \x will be interpreted as a basic hexadecimal |
Instead, the initial \x will be interpreted as a basic hexadecimal |
| 3047 |
escape, with no following digits, giving a character whose value is |
escape, with no following digits, giving a character whose value is |
| 3048 |
zero. |
zero. |
| 3049 |
|
|
| 3050 |
Characters whose value is less than 256 can be defined by either of the |
Characters whose value is less than 256 can be defined by either of the |
| 3051 |
two syntaxes for \x. There is no difference in the way they are han- |
two syntaxes for \x. There is no difference in the way they are han- |
| 3052 |
dled. For example, \xdc is exactly the same as \x{dc}. |
dled. For example, \xdc is exactly the same as \x{dc}. |
| 3053 |
|
|
| 3054 |
After \0 up to two further octal digits are read. If there are fewer |
After \0 up to two further octal digits are read. If there are fewer |
| 3055 |
than two digits, just those that are present are used. Thus the |
than two digits, just those that are present are used. Thus the |
| 3056 |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
| 3057 |
(code value 7). Make sure you supply two digits after the initial zero |
(code value 7). Make sure you supply two digits after the initial zero |
| 3058 |
if the pattern character that follows is itself an octal digit. |
if the pattern character that follows is itself an octal digit. |
| 3059 |
|
|
| 3060 |
The handling of a backslash followed by a digit other than 0 is compli- |
The handling of a backslash followed by a digit other than 0 is compli- |
| 3061 |
cated. Outside a character class, PCRE reads it and any following dig- |
cated. Outside a character class, PCRE reads it and any following dig- |
| 3062 |
its as a decimal number. If the number is less than 10, or if there |
its as a decimal number. If the number is less than 10, or if there |
| 3063 |
have been at least that many previous capturing left parentheses in the |
have been at least that many previous capturing left parentheses in the |
| 3064 |
expression, the entire sequence is taken as a back reference. A |
expression, the entire sequence is taken as a back reference. A |
| 3065 |
description of how this works is given later, following the discussion |
description of how this works is given later, following the discussion |
| 3066 |
of parenthesized subpatterns. |
of parenthesized subpatterns. |
| 3067 |
|
|
| 3068 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
| 3069 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
| 3070 |
up to three octal digits following the backslash, and uses them to gen- |
up to three octal digits following the backslash, and uses them to gen- |
| 3071 |
erate a data character. Any subsequent digits stand for themselves. In |
erate a data character. Any subsequent digits stand for themselves. In |
| 3072 |
non-UTF-8 mode, the value of a character specified in octal must be |
non-UTF-8 mode, the value of a character specified in octal must be |
| 3073 |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
| 3074 |
example: |
example: |
| 3075 |
|
|
| 3076 |
\040 is another way of writing a space |
\040 is another way of writing a space |
| 3088 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
| 3089 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
| 3090 |
|
|
| 3091 |
Note that octal values of 100 or greater must not be introduced by a |
Note that octal values of 100 or greater must not be introduced by a |
| 3092 |
leading zero, because no more than three octal digits are ever read. |
leading zero, because no more than three octal digits are ever read. |
| 3093 |
|
|
| 3094 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
| 3095 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
| 3096 |
class, the sequence \b is interpreted as the backspace character (hex |
class, the sequence \b is interpreted as the backspace character (hex |
| 3097 |
08), and the sequences \R and \X are interpreted as the characters "R" |
08), and the sequences \R and \X are interpreted as the characters "R" |
| 3098 |
and "X", respectively. Outside a character class, these sequences have |
and "X", respectively. Outside a character class, these sequences have |
| 3099 |
different meanings (see below). |
different meanings (see below). |
| 3100 |
|
|
| 3101 |
Absolute and relative back references |
Absolute and relative back references |
| 3102 |
|
|
| 3103 |
The sequence \g followed by an unsigned or a negative number, option- |
The sequence \g followed by an unsigned or a negative number, option- |
| 3104 |
ally enclosed in braces, is an absolute or relative back reference. A |
ally enclosed in braces, is an absolute or relative back reference. A |
| 3105 |
named back reference can be coded as \g{name}. Back references are dis- |
named back reference can be coded as \g{name}. Back references are dis- |
| 3106 |
cussed later, following the discussion of parenthesized subpatterns. |
cussed later, following the discussion of parenthesized subpatterns. |
| 3107 |
|
|
| 3122 |
\W any "non-word" character |
\W any "non-word" character |
| 3123 |
|
|
| 3124 |
Each pair of escape sequences partitions the complete set of characters |
Each pair of escape sequences partitions the complete set of characters |
| 3125 |
into two disjoint sets. Any given character matches one, and only one, |
into two disjoint sets. Any given character matches one, and only one, |
| 3126 |
of each pair. |
of each pair. |
| 3127 |
|
|
| 3128 |
These character type sequences can appear both inside and outside char- |
These character type sequences can appear both inside and outside char- |
| 3129 |
acter classes. They each match one character of the appropriate type. |
acter classes. They each match one character of the appropriate type. |
| 3130 |
If the current matching point is at the end of the subject string, all |
If the current matching point is at the end of the subject string, all |
| 3131 |
of them fail, since there is no character to match. |
of them fail, since there is no character to match. |
| 3132 |
|
|
| 3133 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
| 3134 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
| 3135 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
| 3136 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
| 3137 |
ter. In PCRE, it never does. |
ter. In PCRE, it never does. |
| 3138 |
|
|
| 3139 |
In UTF-8 mode, characters with values greater than 128 never match \d, |
In UTF-8 mode, characters with values greater than 128 never match \d, |
| 3140 |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
| 3141 |
code character property support is available. These sequences retain |
code character property support is available. These sequences retain |
| 3142 |
their original meanings from before UTF-8 support was available, mainly |
their original meanings from before UTF-8 support was available, mainly |
| 3143 |
for efficiency reasons. |
for efficiency reasons. |
| 3144 |
|
|
| 3145 |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
| 3146 |
the other sequences, these do match certain high-valued codepoints in |
the other sequences, these do match certain high-valued codepoints in |
| 3147 |
UTF-8 mode. The horizontal space characters are: |
UTF-8 mode. The horizontal space characters are: |
| 3148 |
|
|
| 3149 |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
| 3177 |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
| 3178 |
|
|
| 3179 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
| 3180 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
| 3181 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
| 3182 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
| 3183 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
| 3184 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
| 3185 |
are used for accented letters, and these are matched by \w. The use of |
are used for accented letters, and these are matched by \w. The use of |
| 3186 |
locales with Unicode is discouraged. |
locales with Unicode is discouraged. |
| 3187 |
|
|
| 3188 |
Newline sequences |
Newline sequences |
| 3189 |
|
|
| 3190 |
Outside a character class, by default, the escape sequence \R matches |
Outside a character class, by default, the escape sequence \R matches |
| 3191 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
| 3192 |
mode \R is equivalent to the following: |
mode \R is equivalent to the following: |
| 3193 |
|
|
| 3194 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
| 3195 |
|
|
| 3196 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
| 3197 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
| 3198 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
| 3199 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
| 3200 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
| 3201 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
| 3202 |
|
|
| 3203 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
| 3204 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
| 3205 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
| 3206 |
these characters to be recognized. |
these characters to be recognized. |
| 3207 |
|
|
| 3208 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
| 3209 |
the complete set of Unicode line endings) by setting the option |
the complete set of Unicode line endings) by setting the option |
| 3210 |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
| 3211 |
This can be made the default when PCRE is built; if this is the case, |
(BSR is an abbrevation for "backslash R".) This can be made the default |
| 3212 |
the other behaviour can be requested via the PCRE_BSR_UNICODE option. |
when PCRE is built; if this is the case, the other behaviour can be |
| 3213 |
It is also possible to specify these settings by starting a pattern |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
| 3214 |
string with one of the following sequences: |
specify these settings by starting a pattern string with one of the |
| 3215 |
|
following sequences: |
| 3216 |
|
|
| 3217 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
| 3218 |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
| 3221 |
they can be overridden by options given to pcre_exec(). Note that these |
they can be overridden by options given to pcre_exec(). Note that these |
| 3222 |
special settings, which are not Perl-compatible, are recognized only at |
special settings, which are not Perl-compatible, are recognized only at |
| 3223 |
the very start of a pattern, and that they must be in upper case. If |
the very start of a pattern, and that they must be in upper case. If |
| 3224 |
more than one of them is present, the last one is used. |
more than one of them is present, the last one is used. They can be |
| 3225 |
|
combined with a change of newline convention, for example, a pattern |
| 3226 |
|
can start with: |
| 3227 |
|
|
| 3228 |
|
(*ANY)(*BSR_ANYCRLF) |
| 3229 |
|
|
| 3230 |
Inside a character class, \R matches the letter "R". |
Inside a character class, \R matches the letter "R". |
| 3231 |
|
|
| 4856 |
|
|
| 4857 |
REVISION |
REVISION |
| 4858 |
|
|
| 4859 |
Last updated: 11 September 2007 |
Last updated: 14 September 2007 |
| 4860 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
| 4861 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 4862 |
|
|