| 156 |
support), the escape sequences \p{..}, \P{..}, and \X are supported. |
support), the escape sequences \p{..}, \P{..}, and \X are supported. |
| 157 |
The available properties that can be tested are limited to the general |
The available properties that can be tested are limited to the general |
| 158 |
category properties such as Lu for an upper case letter or Nd for a decimal |
category properties such as Lu for an upper case letter or Nd for a decimal |
| 159 |
number. A full list is given in the |
number, the Unicode script names such as Arabic or Han, and the derived |
| 160 |
|
properties Any and L&. A full list is given in the |
| 161 |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
| 162 |
documentation. The PCRE library is increased in size by about 90K when Unicode |
documentation. Only the short names for properties are supported. For example, |
| 163 |
property support is included. |
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported. |
| 164 |
|
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for |
| 165 |
|
compatibility with Perl 5.6. PCRE does not support this. |
| 166 |
</P> |
</P> |
| 167 |
<P> |
<P> |
| 168 |
The following comments apply when PCRE is running in UTF-8 mode: |
The following comments apply when PCRE is running in UTF-8 mode: |
| 180 |
may crash. |
may crash. |
| 181 |
</P> |
</P> |
| 182 |
<P> |
<P> |
| 183 |
2. In a pattern, the escape sequence \x{...}, where the contents of the braces |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a two-byte |
| 184 |
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose |
UTF-8 character if the value is greater than 127. |
|
code number is the given hexadecimal number, for example: \x{1234}. If a |
|
|
non-hexadecimal digit appears between the braces, the item is not recognized. |
|
|
This escape sequence can be used either as a literal, or within a character |
|
|
class. |
|
| 185 |
</P> |
</P> |
| 186 |
<P> |
<P> |
| 187 |
3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 |
3. Repeat quantifiers apply to complete UTF-8 characters, not to individual |
|
character if the value is greater than 127. |
|
|
</P> |
|
|
<P> |
|
|
4. Repeat quantifiers apply to complete UTF-8 characters, not to individual |
|
| 188 |
bytes, for example: \x{100}{3}. |
bytes, for example: \x{100}{3}. |
| 189 |
</P> |
</P> |
| 190 |
<P> |
<P> |
| 191 |
5. The dot metacharacter matches one UTF-8 character instead of a single byte. |
4. The dot metacharacter matches one UTF-8 character instead of a single byte. |
| 192 |
</P> |
</P> |
| 193 |
<P> |
<P> |
| 194 |
6. The escape sequence \C can be used to match a single byte in UTF-8 mode, |
5. The escape sequence \C can be used to match a single byte in UTF-8 mode, |
| 195 |
but its use can lead to some strange effects. This facility is not available in |
but its use can lead to some strange effects. This facility is not available in |
| 196 |
the alternative matching function, <b>pcre_dfa_exec()</b>. |
the alternative matching function, <b>pcre_dfa_exec()</b>. |
| 197 |
</P> |
</P> |
| 198 |
<P> |
<P> |
| 199 |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
| 200 |
test characters of any code value, but the characters that PCRE recognizes as |
test characters of any code value, but the characters that PCRE recognizes as |
| 201 |
digits, spaces, or word characters remain the same set as before, all with |
digits, spaces, or word characters remain the same set as before, all with |
| 202 |
values less than 256. This remains true even when PCRE includes Unicode |
values less than 256. This remains true even when PCRE includes Unicode |
| 205 |
must use Unicode property tests such as \p{Nd}. |
must use Unicode property tests such as \p{Nd}. |
| 206 |
</P> |
</P> |
| 207 |
<P> |
<P> |
| 208 |
8. Similarly, characters that match the POSIX named character classes are all |
7. Similarly, characters that match the POSIX named character classes are all |
| 209 |
low-valued characters. |
low-valued characters. |
| 210 |
</P> |
</P> |
| 211 |
<P> |
<P> |
| 212 |
9. Case-insensitive matching applies only to characters whose values are less |
8. Case-insensitive matching applies only to characters whose values are less |
| 213 |
than 128, unless PCRE is built with Unicode property support. Even when Unicode |
than 128, unless PCRE is built with Unicode property support. Even when Unicode |
| 214 |
property support is available, PCRE still uses its own character tables when |
property support is available, PCRE still uses its own character tables when |
| 215 |
checking the case of low-valued characters, so as not to degrade performance. |
checking the case of low-valued characters, so as not to degrade performance. |
| 216 |
The Unicode property information is used only for characters with higher |
The Unicode property information is used only for characters with higher |
| 217 |
values. |
values. Even when Unicode property support is available, PCRE supports |
| 218 |
|
case-insensitive matching only when there is a one-to-one mapping between a |
| 219 |
|
letter's cases. There are a small number of many-to-one mappings in Unicode; |
| 220 |
|
these are not supported by PCRE. |
| 221 |
</P> |
</P> |
| 222 |
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br> |
| 223 |
<P> |
<P> |
| 231 |
Putting an actual email address here seems to have been a spam magnet, so I've |
Putting an actual email address here seems to have been a spam magnet, so I've |
| 232 |
taken it away. If you want to email me, use my initial and surname, separated |
taken it away. If you want to email me, use my initial and surname, separated |
| 233 |
by a dot, at the domain ucs.cam.ac.uk. |
by a dot, at the domain ucs.cam.ac.uk. |
| 234 |
Last updated: 07 March 2005 |
Last updated: 24 January 2006 |
| 235 |
<br> |
<br> |
| 236 |
Copyright © 1997-2005 University of Cambridge. |
Copyright © 1997-2006 University of Cambridge. |
| 237 |
<p> |
<p> |
| 238 |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |
| 239 |
</p> |
</p> |