| 156 |
support), the escape sequences \ep{..}, \eP{..}, and \eX are supported. |
support), the escape sequences \ep{..}, \eP{..}, and \eX are supported. |
| 157 |
The available properties that can be tested are limited to the general |
The available properties that can be tested are limited to the general |
| 158 |
category properties such as Lu for an upper case letter or Nd for a decimal |
category properties such as Lu for an upper case letter or Nd for a decimal |
| 159 |
number. A full list is given in the |
number, the Unicode script names such as Arabic or Han, and the derived |
| 160 |
|
properties Any and L&. A full list is given in the |
| 161 |
.\" HREF |
.\" HREF |
| 162 |
\fBpcrepattern\fP |
\fBpcrepattern\fP |
| 163 |
.\" |
.\" |
| 164 |
documentation. The PCRE library is increased in size by about 90K when Unicode |
documentation. Only the short names for properties are supported. For example, |
| 165 |
property support is included. |
\ep{L} matches a letter. Its Perl synonym, \ep{Letter}, is not supported. |
| 166 |
|
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for |
| 167 |
|
compatibility with Perl 5.6. PCRE does not support this. |
| 168 |
.P |
.P |
| 169 |
The following comments apply when PCRE is running in UTF-8 mode: |
The following comments apply when PCRE is running in UTF-8 mode: |
| 170 |
.P |
.P |
| 179 |
PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program |
PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program |
| 180 |
may crash. |
may crash. |
| 181 |
.P |
.P |
| 182 |
2. In a pattern, the escape sequence \ex{...}, where the contents of the braces |
2. An unbraced hexadecimal escape sequence (such as \exb3) matches a two-byte |
| 183 |
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose |
UTF-8 character if the value is greater than 127. |
|
code number is the given hexadecimal number, for example: \ex{1234}. If a |
|
|
non-hexadecimal digit appears between the braces, the item is not recognized. |
|
|
This escape sequence can be used either as a literal, or within a character |
|
|
class. |
|
| 184 |
.P |
.P |
| 185 |
3. The original hexadecimal escape sequence, \exhh, matches a two-byte UTF-8 |
3. Repeat quantifiers apply to complete UTF-8 characters, not to individual |
|
character if the value is greater than 127. |
|
|
.P |
|
|
4. Repeat quantifiers apply to complete UTF-8 characters, not to individual |
|
| 186 |
bytes, for example: \ex{100}{3}. |
bytes, for example: \ex{100}{3}. |
| 187 |
.P |
.P |
| 188 |
5. The dot metacharacter matches one UTF-8 character instead of a single byte. |
4. The dot metacharacter matches one UTF-8 character instead of a single byte. |
| 189 |
.P |
.P |
| 190 |
6. The escape sequence \eC can be used to match a single byte in UTF-8 mode, |
5. The escape sequence \eC can be used to match a single byte in UTF-8 mode, |
| 191 |
but its use can lead to some strange effects. This facility is not available in |
but its use can lead to some strange effects. This facility is not available in |
| 192 |
the alternative matching function, \fBpcre_dfa_exec()\fP. |
the alternative matching function, \fBpcre_dfa_exec()\fP. |
| 193 |
.P |
.P |
| 194 |
7. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly |
6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly |
| 195 |
test characters of any code value, but the characters that PCRE recognizes as |
test characters of any code value, but the characters that PCRE recognizes as |
| 196 |
digits, spaces, or word characters remain the same set as before, all with |
digits, spaces, or word characters remain the same set as before, all with |
| 197 |
values less than 256. This remains true even when PCRE includes Unicode |
values less than 256. This remains true even when PCRE includes Unicode |
| 199 |
cases. If you really want to test for a wider sense of, say, "digit", you |
cases. If you really want to test for a wider sense of, say, "digit", you |
| 200 |
must use Unicode property tests such as \ep{Nd}. |
must use Unicode property tests such as \ep{Nd}. |
| 201 |
.P |
.P |
| 202 |
8. Similarly, characters that match the POSIX named character classes are all |
7. Similarly, characters that match the POSIX named character classes are all |
| 203 |
low-valued characters. |
low-valued characters. |
| 204 |
.P |
.P |
| 205 |
9. Case-insensitive matching applies only to characters whose values are less |
8. Case-insensitive matching applies only to characters whose values are less |
| 206 |
than 128, unless PCRE is built with Unicode property support. Even when Unicode |
than 128, unless PCRE is built with Unicode property support. Even when Unicode |
| 207 |
property support is available, PCRE still uses its own character tables when |
property support is available, PCRE still uses its own character tables when |
| 208 |
checking the case of low-valued characters, so as not to degrade performance. |
checking the case of low-valued characters, so as not to degrade performance. |
| 209 |
The Unicode property information is used only for characters with higher |
The Unicode property information is used only for characters with higher |
| 210 |
values. |
values. Even when Unicode property support is available, PCRE supports |
| 211 |
|
case-insensitive matching only when there is a one-to-one mapping between a |
| 212 |
|
letter's cases. There are a small number of many-to-one mappings in Unicode; |
| 213 |
|
these are not supported by PCRE. |
| 214 |
. |
. |
| 215 |
.SH AUTHOR |
.SH AUTHOR |
| 216 |
.rs |
.rs |
| 226 |
by a dot, at the domain ucs.cam.ac.uk. |
by a dot, at the domain ucs.cam.ac.uk. |
| 227 |
.sp |
.sp |
| 228 |
.in 0 |
.in 0 |
| 229 |
Last updated: 07 March 2005 |
Last updated: 24 January 2006 |
| 230 |
.br |
.br |
| 231 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |