| 30 |
for requesting some minor changes that give better JavaScript compatibility. |
for requesting some minor changes that give better JavaScript compatibility. |
| 31 |
</P> |
</P> |
| 32 |
<P> |
<P> |
| 33 |
The current implementation of PCRE corresponds approximately with Perl |
The current implementation of PCRE corresponds approximately with Perl 5.12, |
| 34 |
5.10/5.11, including support for UTF-8 encoded strings and Unicode general |
including support for UTF-8 encoded strings and Unicode general category |
| 35 |
category properties. However, UTF-8 and Unicode support has to be explicitly |
properties. However, UTF-8 and Unicode support has to be explicitly enabled; it |
| 36 |
enabled; it is not the default. The Unicode tables correspond to Unicode |
is not the default. The Unicode tables correspond to Unicode release 6.0.0. |
|
release 5.2.0. |
|
| 37 |
</P> |
</P> |
| 38 |
<P> |
<P> |
| 39 |
In addition to the Perl-compatible matching function, PCRE contains an |
In addition to the Perl-compatible matching function, PCRE contains an |
| 207 |
UTF-8.) |
UTF-8.) |
| 208 |
</P> |
</P> |
| 209 |
<P> |
<P> |
| 210 |
If an invalid UTF-8 string is passed to PCRE, an error return |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
| 211 |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know that |
compile time, the only additional information is the offset to the first byte |
| 212 |
your strings are valid, and therefore want to skip these checks in order to |
of the failing character. The runtime functions (<b>pcre_exec()</b> and |
| 213 |
improve performance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or |
<b>pcre_dfa_exec()</b>), pass back this information as well as a more detailed |
| 214 |
at run time, PCRE assumes that the pattern or subject it is given |
reason code if the caller has provided memory in which to do this. |
| 215 |
(respectively) contains only valid UTF-8 codes. In this case, it does not |
</P> |
| 216 |
diagnose an invalid UTF-8 string. |
<P> |
| 217 |
|
In some situations, you may already know that your strings are valid, and |
| 218 |
|
therefore want to skip these checks in order to improve performance. If you set |
| 219 |
|
the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that |
| 220 |
|
the pattern or subject it is given (respectively) contains only valid UTF-8 |
| 221 |
|
codes. In this case, it does not diagnose an invalid UTF-8 string. |
| 222 |
</P> |
</P> |
| 223 |
<P> |
<P> |
| 224 |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
| 264 |
recognizes as digits, spaces, or word characters remain the same set as before, |
recognizes as digits, spaces, or word characters remain the same set as before, |
| 265 |
all with values less than 256. This remains true even when PCRE is built to |
all with values less than 256. This remains true even when PCRE is built to |
| 266 |
include Unicode property support, because to do otherwise would slow down PCRE |
include Unicode property support, because to do otherwise would slow down PCRE |
| 267 |
in many common cases. Note that this also applies to \b, because it is defined |
in many common cases. Note in particular that this applies to \b and \B, |
| 268 |
in terms of \w and \W. If you really want to test for a wider sense of, say, |
because they are defined in terms of \w and \W. If you really want to test |
| 269 |
"digit", you can use explicit Unicode property tests such as \p{Nd}. |
for a wider sense of, say, "digit", you can use explicit Unicode property tests |
| 270 |
Alternatively, if you set the PCRE_UCP option, the way that the character |
such as \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that |
| 271 |
escapes work is changed so that Unicode properties are used to determine which |
the character escapes work is changed so that Unicode properties are used to |
| 272 |
characters match. There are more details in the section on |
determine which characters match. There are more details in the section on |
| 273 |
<a href="pcrepattern.html#genericchartypes">generic character types</a> |
<a href="pcrepattern.html#genericchartypes">generic character types</a> |
| 274 |
in the |
in the |
| 275 |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
| 280 |
low-valued characters, unless the PCRE_UCP option is set. |
low-valued characters, unless the PCRE_UCP option is set. |
| 281 |
</P> |
</P> |
| 282 |
<P> |
<P> |
| 283 |
8. However, the Perl 5.10 horizontal and vertical whitespace matching escapes |
8. However, the horizontal and vertical whitespace matching escapes (\h, \H, |
| 284 |
(\h, \H, \v, and \V) do match all the appropriate Unicode characters, |
\v, and \V) do match all the appropriate Unicode characters, whether or not |
| 285 |
whether or not PCRE_UCP is set. |
PCRE_UCP is set. |
| 286 |
</P> |
</P> |
| 287 |
<P> |
<P> |
| 288 |
9. Case-insensitive matching applies only to characters whose values are less |
9. Case-insensitive matching applies only to characters whose values are less |
| 290 |
property support is available, PCRE still uses its own character tables when |
property support is available, PCRE still uses its own character tables when |
| 291 |
checking the case of low-valued characters, so as not to degrade performance. |
checking the case of low-valued characters, so as not to degrade performance. |
| 292 |
The Unicode property information is used only for characters with higher |
The Unicode property information is used only for characters with higher |
| 293 |
values. Even when Unicode property support is available, PCRE supports |
values. Furthermore, PCRE supports case-insensitive matching only when there is |
| 294 |
case-insensitive matching only when there is a one-to-one mapping between a |
a one-to-one mapping between a letter's cases. There are a small number of |
| 295 |
letter's cases. There are a small number of many-to-one mappings in Unicode; |
many-to-one mappings in Unicode; these are not supported by PCRE. |
|
these are not supported by PCRE. |
|
| 296 |
</P> |
</P> |
| 297 |
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC5" href="#TOC1">AUTHOR</a><br> |
| 298 |
<P> |
<P> |
| 310 |
</P> |
</P> |
| 311 |
<br><a name="SEC6" href="#TOC1">REVISION</a><br> |
<br><a name="SEC6" href="#TOC1">REVISION</a><br> |
| 312 |
<P> |
<P> |
| 313 |
Last updated: 12 May 2010 |
Last updated: 07 May 2011 |
| 314 |
<br> |
<br> |
| 315 |
Copyright © 1997-2010 University of Cambridge. |
Copyright © 1997-2011 University of Cambridge. |
| 316 |
<br> |
<br> |
| 317 |
<p> |
<p> |
| 318 |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |