| 148 |
\et tab (hex 09) |
\et tab (hex 09) |
| 149 |
\eddd character with octal code ddd, or backreference |
\eddd character with octal code ddd, or backreference |
| 150 |
\exhh character with hex code hh |
\exhh character with hex code hh |
| 151 |
\ex{hhh..} character with hex code hhh... (UTF-8 mode only) |
\ex{hhh..} character with hex code hhh.. |
| 152 |
.sp |
.sp |
| 153 |
The precise effect of \ecx is as follows: if x is a lower case letter, it |
The precise effect of \ecx is as follows: if x is a lower case letter, it |
| 154 |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
| 156 |
7B. |
7B. |
| 157 |
.P |
.P |
| 158 |
After \ex, from zero to two hexadecimal digits are read (letters can be in |
After \ex, from zero to two hexadecimal digits are read (letters can be in |
| 159 |
upper or lower case). In UTF-8 mode, any number of hexadecimal digits may |
upper or lower case). Any number of hexadecimal digits may appear between \ex{ |
| 160 |
appear between \ex{ and }, but the value of the character code must be less |
and }, but the value of the character code must be less than 256 in non-UTF-8 |
| 161 |
than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters |
mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value |
| 162 |
other than hexadecimal digits appear between \ex{ and }, or if there is no |
is 7FFFFFFF). If characters other than hexadecimal digits appear between \ex{ |
| 163 |
terminating }, this form of escape is not recognized. Instead, the initial |
and }, or if there is no terminating }, this form of escape is not recognized. |
| 164 |
\ex will be interpreted as a basic hexadecimal escape, with no following |
Instead, the initial \ex will be interpreted as a basic hexadecimal escape, |
| 165 |
digits, giving a character whose value is zero. |
with no following digits, giving a character whose value is zero. |
| 166 |
.P |
.P |
| 167 |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
| 168 |
syntaxes for \ex when PCRE is in UTF-8 mode. There is no difference in the |
syntaxes for \ex. There is no difference in the way they are handled. For |
| 169 |
way they are handled. For example, \exdc is exactly the same as \ex{dc}. |
example, \exdc is exactly the same as \ex{dc}. |
| 170 |
.P |
.P |
| 171 |
After \e0 up to two further octal digits are read. In both cases, if there |
After \e0 up to two further octal digits are read. In both cases, if there |
| 172 |
are fewer than two digits, just those that are present are used. Thus the |
are fewer than two digits, just those that are present are used. Thus the |
| 272 |
.P |
.P |
| 273 |
In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or |
In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or |
| 274 |
\ew, and always match \eD, \eS, and \eW. This is true even when Unicode |
\ew, and always match \eD, \eS, and \eW. This is true even when Unicode |
| 275 |
character property support is available. |
character property support is available. The use of locales with Unicode is |
| 276 |
|
discouraged. |
| 277 |
. |
. |
| 278 |
. |
. |
| 279 |
.\" HTML <a name="uniextseq"></a> |
.\" HTML <a name="uniextseq"></a> |
| 281 |
.rs |
.rs |
| 282 |
.sp |
.sp |
| 283 |
When PCRE is built with Unicode character property support, three additional |
When PCRE is built with Unicode character property support, three additional |
| 284 |
escape sequences to match generic character types are available when UTF-8 mode |
escape sequences to match character properties are available when UTF-8 mode |
| 285 |
is selected. They are: |
is selected. They are: |
| 286 |
.sp |
.sp |
| 287 |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
| 288 |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
| 289 |
\eX an extended Unicode sequence |
\eX an extended Unicode sequence |
| 290 |
.sp |
.sp |
| 291 |
The property names represented by \fIxx\fP above are limited to the |
The property names represented by \fIxx\fP above are limited to the Unicode |
| 292 |
Unicode general category properties. Each character has exactly one such |
script names, the general category properties, and "Any", which matches any |
| 293 |
property, specified by a two-letter abbreviation. For compatibility with Perl, |
character (including newline). Other properties such as "InMusicalSymbols" are |
| 294 |
negation can be specified by including a circumflex between the opening brace |
not currently supported by PCRE. Note that \eP{Any} does not match any |
| 295 |
and the property name. For example, \ep{^Lu} is the same as \eP{Lu}. |
characters, so always causes a match failure. |
| 296 |
.P |
.P |
| 297 |
If only one letter is specified with \ep or \eP, it includes all the properties |
Sets of Unicode characters are defined as belonging to certain scripts. A |
| 298 |
that start with that letter. In this case, in the absence of negation, the |
character from one of these sets can be matched using a script name. For |
| 299 |
curly brackets in the escape sequence are optional; these two examples have |
example: |
| 300 |
the same effect: |
.sp |
| 301 |
|
\ep{Greek} |
| 302 |
|
\eP{Han} |
| 303 |
|
.sp |
| 304 |
|
Those that are not part of an identified script are lumped together as |
| 305 |
|
"Common". The current list of scripts is: |
| 306 |
|
.P |
| 307 |
|
Arabic, |
| 308 |
|
Armenian, |
| 309 |
|
Bengali, |
| 310 |
|
Bopomofo, |
| 311 |
|
Braille, |
| 312 |
|
Buginese, |
| 313 |
|
Buhid, |
| 314 |
|
Canadian_Aboriginal, |
| 315 |
|
Cherokee, |
| 316 |
|
Common, |
| 317 |
|
Coptic, |
| 318 |
|
Cypriot, |
| 319 |
|
Cyrillic, |
| 320 |
|
Deseret, |
| 321 |
|
Devanagari, |
| 322 |
|
Ethiopic, |
| 323 |
|
Georgian, |
| 324 |
|
Glagolitic, |
| 325 |
|
Gothic, |
| 326 |
|
Greek, |
| 327 |
|
Gujarati, |
| 328 |
|
Gurmukhi, |
| 329 |
|
Han, |
| 330 |
|
Hangul, |
| 331 |
|
Hanunoo, |
| 332 |
|
Hebrew, |
| 333 |
|
Hiragana, |
| 334 |
|
Inherited, |
| 335 |
|
Kannada, |
| 336 |
|
Katakana, |
| 337 |
|
Kharoshthi, |
| 338 |
|
Khmer, |
| 339 |
|
Lao, |
| 340 |
|
Latin, |
| 341 |
|
Limbu, |
| 342 |
|
Linear_B, |
| 343 |
|
Malayalam, |
| 344 |
|
Mongolian, |
| 345 |
|
Myanmar, |
| 346 |
|
New_Tai_Lue, |
| 347 |
|
Ogham, |
| 348 |
|
Old_Italic, |
| 349 |
|
Old_Persian, |
| 350 |
|
Oriya, |
| 351 |
|
Osmanya, |
| 352 |
|
Runic, |
| 353 |
|
Shavian, |
| 354 |
|
Sinhala, |
| 355 |
|
Syloti_Nagri, |
| 356 |
|
Syriac, |
| 357 |
|
Tagalog, |
| 358 |
|
Tagbanwa, |
| 359 |
|
Tai_Le, |
| 360 |
|
Tamil, |
| 361 |
|
Telugu, |
| 362 |
|
Thaana, |
| 363 |
|
Thai, |
| 364 |
|
Tibetan, |
| 365 |
|
Tifinagh, |
| 366 |
|
Ugaritic, |
| 367 |
|
Yi. |
| 368 |
|
.P |
| 369 |
|
Each character has exactly one general category property, specified by a |
| 370 |
|
two-letter abbreviation. For compatibility with Perl, negation can be specified |
| 371 |
|
by including a circumflex between the opening brace and the property name. For |
| 372 |
|
example, \ep{^Lu} is the same as \eP{Lu}. |
| 373 |
|
.P |
| 374 |
|
If only one letter is specified with \ep or \eP, it includes all the general |
| 375 |
|
category properties that start with that letter. In this case, in the absence |
| 376 |
|
of negation, the curly brackets in the escape sequence are optional; these two |
| 377 |
|
examples have the same effect: |
| 378 |
.sp |
.sp |
| 379 |
\ep{L} |
\ep{L} |
| 380 |
\epL |
\epL |
| 381 |
.sp |
.sp |
| 382 |
The following property codes are supported: |
The following general category property codes are supported: |
| 383 |
.sp |
.sp |
| 384 |
C Other |
C Other |
| 385 |
Cc Control |
Cc Control |
| 425 |
Zp Paragraph separator |
Zp Paragraph separator |
| 426 |
Zs Space separator |
Zs Space separator |
| 427 |
.sp |
.sp |
| 428 |
Extended properties such as "Greek" or "InMusicalSymbols" are not supported by |
The special property L& is also supported: it matches a character that has |
| 429 |
PCRE. |
the Lu, Ll, or Lt property, in other words, a letter that is not classified as |
| 430 |
|
a modifier or "other". |
| 431 |
|
.P |
| 432 |
|
The long synonyms for these properties that Perl supports (such as \ep{Letter}) |
| 433 |
|
are not supported by PCRE. Nor is is permitted to prefix any of these |
| 434 |
|
properties with "Is". |
| 435 |
|
.P |
| 436 |
|
No character that is in the Unicode table has the Cn (unassigned) property. |
| 437 |
|
Instead, this property is assumed for any code point that is not in the |
| 438 |
|
Unicode table. |
| 439 |
.P |
.P |
| 440 |
Specifying caseless matching does not affect these escape sequences. For |
Specifying caseless matching does not affect these escape sequences. For |
| 441 |
example, \ep{Lu} always matches only upper case letters. |
example, \ep{Lu} always matches only upper case letters. |
| 1433 |
"subroutine" call, which is described in the next section.) The special item |
"subroutine" call, which is described in the next section.) The special item |
| 1434 |
(?R) is a recursive call of the entire regular expression. |
(?R) is a recursive call of the entire regular expression. |
| 1435 |
.P |
.P |
| 1436 |
For example, this PCRE pattern solves the nested parentheses problem (assume |
A recursive subpattern call is always treated as an atomic group. That is, once |
| 1437 |
the PCRE_EXTENDED option is set so that white space is ignored): |
it has matched some of the subject string, it is never re-entered, even if |
| 1438 |
|
it contains untried alternatives and there is a subsequent matching failure. |
| 1439 |
|
.P |
| 1440 |
|
This PCRE pattern solves the nested parentheses problem (assume the |
| 1441 |
|
PCRE_EXTENDED option is set so that white space is ignored): |
| 1442 |
.sp |
.sp |
| 1443 |
\e( ( (?>[^()]+) | (?R) )* \e) |
\e( ( (?>[^()]+) | (?R) )* \e) |
| 1444 |
.sp |
.sp |
| 1445 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
| 1446 |
substrings which can either be a sequence of non-parentheses, or a recursive |
substrings which can either be a sequence of non-parentheses, or a recursive |
| 1447 |
match of the pattern itself (that is a correctly parenthesized substring). |
match of the pattern itself (that is, a correctly parenthesized substring). |
| 1448 |
Finally there is a closing parenthesis. |
Finally there is a closing parenthesis. |
| 1449 |
.P |
.P |
| 1450 |
If this were part of a larger pattern, you would not want to recurse the entire |
If this were part of a larger pattern, you would not want to recurse the entire |
| 1528 |
is used, it does match "sense and responsibility" as well as the other two |
is used, it does match "sense and responsibility" as well as the other two |
| 1529 |
strings. Such references must, however, follow the subpattern to which they |
strings. Such references must, however, follow the subpattern to which they |
| 1530 |
refer. |
refer. |
| 1531 |
|
.P |
| 1532 |
|
Like recursive subpatterns, a "subroutine" call is always treated as an atomic |
| 1533 |
|
group. That is, once it has matched some of the subject string, it is never |
| 1534 |
|
re-entered, even if it contains untried alternatives and there is a subsequent |
| 1535 |
|
matching failure. |
| 1536 |
. |
. |
| 1537 |
. |
. |
| 1538 |
.SH CALLOUTS |
.SH CALLOUTS |
| 1571 |
documentation. |
documentation. |
| 1572 |
.P |
.P |
| 1573 |
.in 0 |
.in 0 |
| 1574 |
Last updated: 28 February 2005 |
Last updated: 24 January 2006 |
| 1575 |
.br |
.br |
| 1576 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |