| 17 |
</b><br> |
</b><br> |
| 18 |
<P> |
<P> |
| 19 |
From Release 8.30, in addition to its previous UTF-8 support, PCRE also |
From Release 8.30, in addition to its previous UTF-8 support, PCRE also |
| 20 |
supports UTF-16 by means of a separate 16-bit library. This can be built as |
supports UTF-16 by means of a separate 16-bit library. This can be built as |
| 21 |
well as, or instead of, the 8-bit library. |
well as, or instead of, the 8-bit library. |
| 22 |
</P> |
</P> |
| 23 |
<br><b> |
<br><b> |
| 82 |
</P> |
</P> |
| 83 |
<P> |
<P> |
| 84 |
The excluded code points are the "Surrogate Area" of Unicode. They are reserved |
The excluded code points are the "Surrogate Area" of Unicode. They are reserved |
| 85 |
for use by UTF-16, where they are used in pairs to encode codepoints with |
for use by UTF-16, where they are used in pairs to encode codepoints with |
| 86 |
values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
| 87 |
are available independently in the UTF-8 encoding. (In other words, the whole |
are available independently in the UTF-8 encoding. (In other words, the whole |
| 88 |
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
| 161 |
data units, for example: \x{100}{3}. |
data units, for example: \x{100}{3}. |
| 162 |
</P> |
</P> |
| 163 |
<P> |
<P> |
| 164 |
4. The dot metacharacter matches one UTF character instead of a single data |
4. The dot metacharacter matches one UTF character instead of a single data |
| 165 |
unit. |
unit. |
| 166 |
</P> |
</P> |
| 167 |
<P> |
<P> |
| 179 |
<P> |
<P> |
| 180 |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
| 181 |
test characters of any code value, but, by default, the characters that PCRE |
test characters of any code value, but, by default, the characters that PCRE |
| 182 |
recognizes as digits, spaces, or word characters remain the same set as in |
recognizes as digits, spaces, or word characters remain the same set as in |
| 183 |
non-UTF mode, all with values less than 256. This remains true even when PCRE |
non-UTF mode, all with values less than 256. This remains true even when PCRE |
| 184 |
is built to include Unicode property support, because to do otherwise would |
is built to include Unicode property support, because to do otherwise would |
| 185 |
slow down PCRE in many common cases. Note in particular that this applies to |
slow down PCRE in many common cases. Note in particular that this applies to |