| 18 |
<li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a> |
<li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a> |
| 19 |
<li><a name="TOC4" href="#SEC4">BACKSLASH</a> |
<li><a name="TOC4" href="#SEC4">BACKSLASH</a> |
| 20 |
<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> |
<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> |
| 21 |
<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT)</a> |
<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a> |
| 22 |
<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a> |
<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a> |
| 23 |
<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> |
<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> |
| 24 |
<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> |
<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> |
| 124 |
is used. |
is used. |
| 125 |
</P> |
</P> |
| 126 |
<P> |
<P> |
| 127 |
The newline convention does not affect what the \R escape sequence matches. By |
The newline convention affects the interpretation of the dot metacharacter when |
| 128 |
default, this is any Unicode newline sequence, for Perl compatibility. However, |
PCRE_DOTALL is not set, and also the behaviour of \N. However, it does not |
| 129 |
this can be changed; see the description of \R in the section entitled |
affect what the \R escape sequence matches. By default, this is any Unicode |
| 130 |
|
newline sequence, for Perl compatibility. However, this can be changed; see the |
| 131 |
|
description of \R in the section entitled |
| 132 |
<a href="#newlineseq">"Newline sequences"</a> |
<a href="#newlineseq">"Newline sequences"</a> |
| 133 |
below. A change of \R setting can be combined with a change of newline |
below. A change of \R setting can be combined with a change of newline |
| 134 |
convention. |
convention. |
| 310 |
<P> |
<P> |
| 311 |
All the sequences that define a single character value can be used both inside |
All the sequences that define a single character value can be used both inside |
| 312 |
and outside character classes. In addition, inside a character class, the |
and outside character classes. In addition, inside a character class, the |
| 313 |
sequence \b is interpreted as the backspace character (hex 08), and the |
sequence \b is interpreted as the backspace character (hex 08). The sequences |
| 314 |
sequences \R and \X are interpreted as the characters "R" and "X", |
\B, \N, \R, and \X are not special inside a character class. Like any other |
| 315 |
respectively. Outside a character class, these sequences have different |
unrecognized escape sequences, they are treated as the literal characters "B", |
| 316 |
meanings |
"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is |
| 317 |
<a href="#uniextseq">(see below).</a> |
set. Outside a character class, these sequences have different meanings. |
| 318 |
</P> |
</P> |
| 319 |
<br><b> |
<br><b> |
| 320 |
Absolute and relative back references |
Absolute and relative back references |
| 339 |
synonymous. The former is a back reference; the latter is a |
synonymous. The former is a back reference; the latter is a |
| 340 |
<a href="#subpatternsassubroutines">subroutine</a> |
<a href="#subpatternsassubroutines">subroutine</a> |
| 341 |
call. |
call. |
| 342 |
</P> |
<a name="genericchartypes"></a></P> |
| 343 |
<br><b> |
<br><b> |
| 344 |
Generic character types |
Generic character types |
| 345 |
</b><br> |
</b><br> |
| 346 |
<P> |
<P> |
| 347 |
Another use of backslash is for specifying generic character types. The |
Another use of backslash is for specifying generic character types: |
|
following are always recognized: |
|
| 348 |
<pre> |
<pre> |
| 349 |
\d any decimal digit |
\d any decimal digit |
| 350 |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
| 357 |
\w any "word" character |
\w any "word" character |
| 358 |
\W any "non-word" character |
\W any "non-word" character |
| 359 |
</pre> |
</pre> |
| 360 |
Each pair of escape sequences partitions the complete set of characters into |
There is also the single sequence \N, which matches a non-newline character. |
| 361 |
two disjoint sets. Any given character matches one, and only one, of each pair. |
This is the same as |
| 362 |
|
<a href="#fullstopdot">the "." metacharacter</a> |
| 363 |
|
when PCRE_DOTALL is not set. |
| 364 |
|
</P> |
| 365 |
|
<P> |
| 366 |
|
Each pair of lower and upper case escape sequences partitions the complete set |
| 367 |
|
of characters into two disjoint sets. Any given character matches one, and only |
| 368 |
|
one, of each pair. |
| 369 |
</P> |
</P> |
| 370 |
<P> |
<P> |
| 371 |
These character type sequences can appear both inside and outside character |
These character type sequences can appear both inside and outside character |
| 483 |
<pre> |
<pre> |
| 484 |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
| 485 |
</pre> |
</pre> |
| 486 |
Inside a character class, \R matches the letter "R". |
Inside a character class, \R is treated as an unrecognized escape sequence, |
| 487 |
|
and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is |
| 488 |
|
set. |
| 489 |
<a name="uniextseq"></a></P> |
<a name="uniextseq"></a></P> |
| 490 |
<br><b> |
<br><b> |
| 491 |
Unicode character properties |
Unicode character properties |
| 502 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
| 503 |
</pre> |
</pre> |
| 504 |
The property names represented by <i>xx</i> above are limited to the Unicode |
The property names represented by <i>xx</i> above are limited to the Unicode |
| 505 |
script names, the general category properties, and "Any", which matches any |
script names, the general category properties, "Any", which matches any |
| 506 |
character (including newline). Other properties such as "InMusicalSymbols" are |
character (including newline), and some special PCRE properties (described |
| 507 |
not currently supported by PCRE. Note that \P{Any} does not match any |
in the |
| 508 |
characters, so always causes a match failure. |
<a href="#extraprops">next section).</a> |
| 509 |
|
Other Perl properties such as "InMusicalSymbols" are not currently supported by |
| 510 |
|
PCRE. Note that \P{Any} does not match any characters, so always causes a |
| 511 |
|
match failure. |
| 512 |
</P> |
</P> |
| 513 |
<P> |
<P> |
| 514 |
Sets of Unicode characters are defined as belonging to certain scripts. A |
Sets of Unicode characters are defined as belonging to certain scripts. A |
| 616 |
Yi. |
Yi. |
| 617 |
</P> |
</P> |
| 618 |
<P> |
<P> |
| 619 |
Each character has exactly one general category property, specified by a |
Each character has exactly one Unicode general category property, specified by |
| 620 |
two-letter abbreviation. For compatibility with Perl, negation can be specified |
a two-letter abbreviation. For compatibility with Perl, negation can be |
| 621 |
by including a circumflex between the opening brace and the property name. For |
specified by including a circumflex between the opening brace and the property |
| 622 |
example, \p{^Lu} is the same as \P{Lu}. |
name. For example, \p{^Lu} is the same as \P{Lu}. |
| 623 |
</P> |
</P> |
| 624 |
<P> |
<P> |
| 625 |
If only one letter is specified with \p or \P, it includes all the general |
If only one letter is specified with \p or \P, it includes all the general |
| 721 |
a structure that contains data for over fifteen thousand characters. That is |
a structure that contains data for over fifteen thousand characters. That is |
| 722 |
why the traditional escape sequences such as \d and \w do not use Unicode |
why the traditional escape sequences such as \d and \w do not use Unicode |
| 723 |
properties in PCRE. |
properties in PCRE. |
| 724 |
|
<a name="extraprops"></a></P> |
| 725 |
|
<br><b> |
| 726 |
|
PCRE's additional properties |
| 727 |
|
</b><br> |
| 728 |
|
<P> |
| 729 |
|
As well as the standard Unicode properties described in the previous |
| 730 |
|
section, PCRE supports four more that make it possible to convert traditional |
| 731 |
|
escape sequences such as \w and \s and POSIX character classes to use Unicode |
| 732 |
|
properties. These are: |
| 733 |
|
<pre> |
| 734 |
|
Xan Any alphanumeric character |
| 735 |
|
Xps Any POSIX space character |
| 736 |
|
Xsp Any Perl space character |
| 737 |
|
Xwd Any Perl "word" character |
| 738 |
|
</pre> |
| 739 |
|
Xan matches characters that have either the L (letter) or the N (number) |
| 740 |
|
property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or |
| 741 |
|
carriage return, and any other character that has the Z (separator) property. |
| 742 |
|
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
| 743 |
|
same characters as Xan, plus underscore. |
| 744 |
<a name="resetmatchstart"></a></P> |
<a name="resetmatchstart"></a></P> |
| 745 |
<br><b> |
<br><b> |
| 746 |
Resetting the match start |
Resetting the match start |
| 789 |
\z matches only at the end of the subject |
\z matches only at the end of the subject |
| 790 |
\G matches at the first matching position in the subject |
\G matches at the first matching position in the subject |
| 791 |
</pre> |
</pre> |
| 792 |
These assertions may not appear in character classes (but note that \b has a |
Inside a character class, \b has a different meaning; it matches the backspace |
| 793 |
different meaning, namely the backspace character, inside a character class). |
character. If any other of these assertions appears in a character class, by |
| 794 |
|
default it matches the corresponding literal character (for example, \B |
| 795 |
|
matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid |
| 796 |
|
escape sequence" error is generated instead. |
| 797 |
</P> |
</P> |
| 798 |
<P> |
<P> |
| 799 |
A word boundary is a position in the subject string where the current character |
A word boundary is a position in the subject string where the current character |
| 889 |
Note that the sequences \A, \Z, and \z can be used to match the start and |
Note that the sequences \A, \Z, and \z can be used to match the start and |
| 890 |
end of the subject in both modes, and if all branches of a pattern start with |
end of the subject in both modes, and if all branches of a pattern start with |
| 891 |
\A it is always anchored, whether or not PCRE_MULTILINE is set. |
\A it is always anchored, whether or not PCRE_MULTILINE is set. |
| 892 |
</P> |
<a name="fullstopdot"></a></P> |
| 893 |
<br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br> |
<br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br> |
| 894 |
<P> |
<P> |
| 895 |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
| 896 |
the subject string except (by default) a character that signifies the end of a |
the subject string except (by default) a character that signifies the end of a |
| 915 |
dollar, the only relationship being that they both involve newlines. Dot has no |
dollar, the only relationship being that they both involve newlines. Dot has no |
| 916 |
special meaning in a character class. |
special meaning in a character class. |
| 917 |
</P> |
</P> |
| 918 |
|
<P> |
| 919 |
|
The escape sequence \N always behaves as a dot does when PCRE_DOTALL is not |
| 920 |
|
set. In other words, it matches any one character except one that signifies the |
| 921 |
|
end of a line. |
| 922 |
|
</P> |
| 923 |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> |
| 924 |
<P> |
<P> |
| 925 |
Outside a character class, the escape sequence \C matches any one byte, both |
Outside a character class, the escape sequence \C matches any one byte, both |
| 2589 |
</P> |
</P> |
| 2590 |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
| 2591 |
<P> |
<P> |
| 2592 |
Last updated: 27 March 2010 |
Last updated: 05 May 2010 |
| 2593 |
<br> |
<br> |
| 2594 |
Copyright © 1997-2010 University of Cambridge. |
Copyright © 1997-2010 University of Cambridge. |
| 2595 |
<br> |
<br> |