| 175 |
\t tab (hex 09) |
\t tab (hex 09) |
| 176 |
\ddd character with octal code ddd, or backreference |
\ddd character with octal code ddd, or backreference |
| 177 |
\xhh character with hex code hh |
\xhh character with hex code hh |
| 178 |
\x{hhh..} character with hex code hhh... (UTF-8 mode only) |
\x{hhh..} character with hex code hhh.. |
| 179 |
</pre> |
</pre> |
| 180 |
The precise effect of \cx is as follows: if x is a lower case letter, it |
The precise effect of \cx is as follows: if x is a lower case letter, it |
| 181 |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
| 184 |
</P> |
</P> |
| 185 |
<P> |
<P> |
| 186 |
After \x, from zero to two hexadecimal digits are read (letters can be in |
After \x, from zero to two hexadecimal digits are read (letters can be in |
| 187 |
upper or lower case). In UTF-8 mode, any number of hexadecimal digits may |
upper or lower case). Any number of hexadecimal digits may appear between \x{ |
| 188 |
appear between \x{ and }, but the value of the character code must be less |
and }, but the value of the character code must be less than 256 in non-UTF-8 |
| 189 |
than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters |
mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value |
| 190 |
other than hexadecimal digits appear between \x{ and }, or if there is no |
is 7FFFFFFF). If characters other than hexadecimal digits appear between \x{ |
| 191 |
terminating }, this form of escape is not recognized. Instead, the initial |
and }, or if there is no terminating }, this form of escape is not recognized. |
| 192 |
\x will be interpreted as a basic hexadecimal escape, with no following |
Instead, the initial \x will be interpreted as a basic hexadecimal escape, |
| 193 |
digits, giving a character whose value is zero. |
with no following digits, giving a character whose value is zero. |
| 194 |
</P> |
</P> |
| 195 |
<P> |
<P> |
| 196 |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
| 197 |
syntaxes for \x when PCRE is in UTF-8 mode. There is no difference in the |
syntaxes for \x. There is no difference in the way they are handled. For |
| 198 |
way they are handled. For example, \xdc is exactly the same as \x{dc}. |
example, \xdc is exactly the same as \x{dc}. |
| 199 |
</P> |
</P> |
| 200 |
<P> |
<P> |
| 201 |
After \0 up to two further octal digits are read. In both cases, if there |
After \0 up to two further octal digits are read. In both cases, if there |
| 285 |
<P> |
<P> |
| 286 |
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
| 287 |
\w, and always match \D, \S, and \W. This is true even when Unicode |
\w, and always match \D, \S, and \W. This is true even when Unicode |
| 288 |
character property support is available. |
character property support is available. The use of locales with Unicode is |
| 289 |
|
discouraged. |
| 290 |
<a name="uniextseq"></a></P> |
<a name="uniextseq"></a></P> |
| 291 |
<br><b> |
<br><b> |
| 292 |
Unicode character properties |
Unicode character properties |
| 293 |
</b><br> |
</b><br> |
| 294 |
<P> |
<P> |
| 295 |
When PCRE is built with Unicode character property support, three additional |
When PCRE is built with Unicode character property support, three additional |
| 296 |
escape sequences to match generic character types are available when UTF-8 mode |
escape sequences to match character properties are available when UTF-8 mode |
| 297 |
is selected. They are: |
is selected. They are: |
| 298 |
<pre> |
<pre> |
| 299 |
\p{<i>xx</i>} a character with the <i>xx</i> property |
\p{<i>xx</i>} a character with the <i>xx</i> property |
| 300 |
\P{<i>xx</i>} a character without the <i>xx</i> property |
\P{<i>xx</i>} a character without the <i>xx</i> property |
| 301 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
| 302 |
</pre> |
</pre> |
| 303 |
The property names represented by <i>xx</i> above are limited to the |
The property names represented by <i>xx</i> above are limited to the Unicode |
| 304 |
Unicode general category properties. Each character has exactly one such |
script names, the general category properties, and "Any", which matches any |
| 305 |
property, specified by a two-letter abbreviation. For compatibility with Perl, |
character (including newline). Other properties such as "InMusicalSymbols" are |
| 306 |
negation can be specified by including a circumflex between the opening brace |
not currently supported by PCRE. Note that \P{Any} does not match any |
| 307 |
and the property name. For example, \p{^Lu} is the same as \P{Lu}. |
characters, so always causes a match failure. |
| 308 |
</P> |
</P> |
| 309 |
<P> |
<P> |
| 310 |
If only one letter is specified with \p or \P, it includes all the properties |
Sets of Unicode characters are defined as belonging to certain scripts. A |
| 311 |
that start with that letter. In this case, in the absence of negation, the |
character from one of these sets can be matched using a script name. For |
| 312 |
curly brackets in the escape sequence are optional; these two examples have |
example: |
| 313 |
the same effect: |
<pre> |
| 314 |
|
\p{Greek} |
| 315 |
|
\P{Han} |
| 316 |
|
</pre> |
| 317 |
|
Those that are not part of an identified script are lumped together as |
| 318 |
|
"Common". The current list of scripts is: |
| 319 |
|
</P> |
| 320 |
|
<P> |
| 321 |
|
Arabic, |
| 322 |
|
Armenian, |
| 323 |
|
Bengali, |
| 324 |
|
Bopomofo, |
| 325 |
|
Braille, |
| 326 |
|
Buginese, |
| 327 |
|
Buhid, |
| 328 |
|
Canadian_Aboriginal, |
| 329 |
|
Cherokee, |
| 330 |
|
Common, |
| 331 |
|
Coptic, |
| 332 |
|
Cypriot, |
| 333 |
|
Cyrillic, |
| 334 |
|
Deseret, |
| 335 |
|
Devanagari, |
| 336 |
|
Ethiopic, |
| 337 |
|
Georgian, |
| 338 |
|
Glagolitic, |
| 339 |
|
Gothic, |
| 340 |
|
Greek, |
| 341 |
|
Gujarati, |
| 342 |
|
Gurmukhi, |
| 343 |
|
Han, |
| 344 |
|
Hangul, |
| 345 |
|
Hanunoo, |
| 346 |
|
Hebrew, |
| 347 |
|
Hiragana, |
| 348 |
|
Inherited, |
| 349 |
|
Kannada, |
| 350 |
|
Katakana, |
| 351 |
|
Kharoshthi, |
| 352 |
|
Khmer, |
| 353 |
|
Lao, |
| 354 |
|
Latin, |
| 355 |
|
Limbu, |
| 356 |
|
Linear_B, |
| 357 |
|
Malayalam, |
| 358 |
|
Mongolian, |
| 359 |
|
Myanmar, |
| 360 |
|
New_Tai_Lue, |
| 361 |
|
Ogham, |
| 362 |
|
Old_Italic, |
| 363 |
|
Old_Persian, |
| 364 |
|
Oriya, |
| 365 |
|
Osmanya, |
| 366 |
|
Runic, |
| 367 |
|
Shavian, |
| 368 |
|
Sinhala, |
| 369 |
|
Syloti_Nagri, |
| 370 |
|
Syriac, |
| 371 |
|
Tagalog, |
| 372 |
|
Tagbanwa, |
| 373 |
|
Tai_Le, |
| 374 |
|
Tamil, |
| 375 |
|
Telugu, |
| 376 |
|
Thaana, |
| 377 |
|
Thai, |
| 378 |
|
Tibetan, |
| 379 |
|
Tifinagh, |
| 380 |
|
Ugaritic, |
| 381 |
|
Yi. |
| 382 |
|
</P> |
| 383 |
|
<P> |
| 384 |
|
Each character has exactly one general category property, specified by a |
| 385 |
|
two-letter abbreviation. For compatibility with Perl, negation can be specified |
| 386 |
|
by including a circumflex between the opening brace and the property name. For |
| 387 |
|
example, \p{^Lu} is the same as \P{Lu}. |
| 388 |
|
</P> |
| 389 |
|
<P> |
| 390 |
|
If only one letter is specified with \p or \P, it includes all the general |
| 391 |
|
category properties that start with that letter. In this case, in the absence |
| 392 |
|
of negation, the curly brackets in the escape sequence are optional; these two |
| 393 |
|
examples have the same effect: |
| 394 |
<pre> |
<pre> |
| 395 |
\p{L} |
\p{L} |
| 396 |
\pL |
\pL |
| 397 |
</pre> |
</pre> |
| 398 |
The following property codes are supported: |
The following general category property codes are supported: |
| 399 |
<pre> |
<pre> |
| 400 |
C Other |
C Other |
| 401 |
Cc Control |
Cc Control |
| 441 |
Zp Paragraph separator |
Zp Paragraph separator |
| 442 |
Zs Space separator |
Zs Space separator |
| 443 |
</pre> |
</pre> |
| 444 |
Extended properties such as "Greek" or "InMusicalSymbols" are not supported by |
The special property L& is also supported: it matches a character that has |
| 445 |
PCRE. |
the Lu, Ll, or Lt property, in other words, a letter that is not classified as |
| 446 |
|
a modifier or "other". |
| 447 |
|
</P> |
| 448 |
|
<P> |
| 449 |
|
The long synonyms for these properties that Perl supports (such as \p{Letter}) |
| 450 |
|
are not supported by PCRE. Nor is is permitted to prefix any of these |
| 451 |
|
properties with "Is". |
| 452 |
|
</P> |
| 453 |
|
<P> |
| 454 |
|
No character that is in the Unicode table has the Cn (unassigned) property. |
| 455 |
|
Instead, this property is assumed for any code point that is not in the |
| 456 |
|
Unicode table. |
| 457 |
</P> |
</P> |
| 458 |
<P> |
<P> |
| 459 |
Specifying caseless matching does not affect these escape sequences. For |
Specifying caseless matching does not affect these escape sequences. For |
| 1452 |
(?R) is a recursive call of the entire regular expression. |
(?R) is a recursive call of the entire regular expression. |
| 1453 |
</P> |
</P> |
| 1454 |
<P> |
<P> |
| 1455 |
For example, this PCRE pattern solves the nested parentheses problem (assume |
A recursive subpattern call is always treated as an atomic group. That is, once |
| 1456 |
the PCRE_EXTENDED option is set so that white space is ignored): |
it has matched some of the subject string, it is never re-entered, even if |
| 1457 |
|
it contains untried alternatives and there is a subsequent matching failure. |
| 1458 |
|
</P> |
| 1459 |
|
<P> |
| 1460 |
|
This PCRE pattern solves the nested parentheses problem (assume the |
| 1461 |
|
PCRE_EXTENDED option is set so that white space is ignored): |
| 1462 |
<pre> |
<pre> |
| 1463 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
| 1464 |
</pre> |
</pre> |
| 1465 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
| 1466 |
substrings which can either be a sequence of non-parentheses, or a recursive |
substrings which can either be a sequence of non-parentheses, or a recursive |
| 1467 |
match of the pattern itself (that is a correctly parenthesized substring). |
match of the pattern itself (that is, a correctly parenthesized substring). |
| 1468 |
Finally there is a closing parenthesis. |
Finally there is a closing parenthesis. |
| 1469 |
</P> |
</P> |
| 1470 |
<P> |
<P> |
| 1547 |
strings. Such references must, however, follow the subpattern to which they |
strings. Such references must, however, follow the subpattern to which they |
| 1548 |
refer. |
refer. |
| 1549 |
</P> |
</P> |
| 1550 |
|
<P> |
| 1551 |
|
Like recursive subpatterns, a "subroutine" call is always treated as an atomic |
| 1552 |
|
group. That is, once it has matched some of the subject string, it is never |
| 1553 |
|
re-entered, even if it contains untried alternatives and there is a subsequent |
| 1554 |
|
matching failure. |
| 1555 |
|
</P> |
| 1556 |
<br><a name="SEC20" href="#TOC1">CALLOUTS</a><br> |
<br><a name="SEC20" href="#TOC1">CALLOUTS</a><br> |
| 1557 |
<P> |
<P> |
| 1558 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
| 1589 |
documentation. |
documentation. |
| 1590 |
</P> |
</P> |
| 1591 |
<P> |
<P> |
| 1592 |
Last updated: 28 February 2005 |
Last updated: 24 January 2006 |
| 1593 |
<br> |
<br> |
| 1594 |
Copyright © 1997-2005 University of Cambridge. |
Copyright © 1997-2006 University of Cambridge. |
| 1595 |
<p> |
<p> |
| 1596 |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |
| 1597 |
</p> |
</p> |