| 160 |
<br><a name="SEC5" href="#TOC1">PCRE 16-BIT API 16-BIT-ONLY FUNCTION</a><br> |
<br><a name="SEC5" href="#TOC1">PCRE 16-BIT API 16-BIT-ONLY FUNCTION</a><br> |
| 161 |
<P> |
<P> |
| 162 |
<b>int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *<i>output</i>,</b> |
<b>int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *<i>output</i>,</b> |
| 163 |
<b>PCRE_SPTR16 <i>input</i>, int <i>length</i>, int *<i>byte_order</i>, </b> |
<b>PCRE_SPTR16 <i>input</i>, int <i>length</i>, int *<i>byte_order</i>,</b> |
| 164 |
<b>int <i>keep_boms</i>);</b> |
<b>int <i>keep_boms</i>);</b> |
| 165 |
</P> |
</P> |
| 166 |
<br><a name="SEC6" href="#TOC1">THE PCRE 16-BIT LIBRARY</a><br> |
<br><a name="SEC6" href="#TOC1">THE PCRE 16-BIT LIBRARY</a><br> |
| 177 |
16-bit library. |
16-bit library. |
| 178 |
</P> |
</P> |
| 179 |
<P> |
<P> |
| 180 |
WARNING: A single application can be linked with both libraries, but you must |
WARNING: A single application can be linked with both libraries, but you must |
| 181 |
take care when processing any particular pattern to use functions from just one |
take care when processing any particular pattern to use functions from just one |
| 182 |
library. For example, if you want to study a pattern that was compiled with |
library. For example, if you want to study a pattern that was compiled with |
| 183 |
<b>pcre16_compile()</b>, you must do so with <b>pcre16_study()</b>, not |
<b>pcre16_compile()</b>, you must do so with <b>pcre16_study()</b>, not |
| 184 |
<b>pcre_study()</b>, and you must free the study data with |
<b>pcre_study()</b>, and you must free the study data with |
| 186 |
</P> |
</P> |
| 187 |
<br><a name="SEC7" href="#TOC1">THE HEADER FILE</a><br> |
<br><a name="SEC7" href="#TOC1">THE HEADER FILE</a><br> |
| 188 |
<P> |
<P> |
| 189 |
There is only one header file, <b>pcre.h</b>. It contains prototypes for all the |
There is only one header file, <b>pcre.h</b>. It contains prototypes for all the |
| 190 |
functions in both libraries, as well as definitions of flags, structures, error |
functions in both libraries, as well as definitions of flags, structures, error |
| 191 |
codes, etc. |
codes, etc. |
| 192 |
</P> |
</P> |
| 193 |
<br><a name="SEC8" href="#TOC1">THE LIBRARY NAME</a><br> |
<br><a name="SEC8" href="#TOC1">THE LIBRARY NAME</a><br> |
| 194 |
<P> |
<P> |
| 195 |
In Unix-like systems, the 16-bit library is called <b>libpcre16</b>, and can |
In Unix-like systems, the 16-bit library is called <b>libpcre16</b>, and can |
| 196 |
normally be accesss by adding <b>-lpcre16</b> to the command for linking an |
normally be accesss by adding <b>-lpcre16</b> to the command for linking an |
| 197 |
application that uses PCRE. |
application that uses PCRE. |
| 198 |
</P> |
</P> |
| 199 |
<br><a name="SEC9" href="#TOC1">STRING TYPES</a><br> |
<br><a name="SEC9" href="#TOC1">STRING TYPES</a><br> |
| 200 |
<P> |
<P> |
| 201 |
In the 8-bit library, strings are passed to PCRE library functions as vectors |
In the 8-bit library, strings are passed to PCRE library functions as vectors |
| 202 |
of bytes with the C type "char *". In the 16-bit library, strings are passed as |
of bytes with the C type "char *". In the 16-bit library, strings are passed as |
| 203 |
vectors of unsigned 16-bit quantities. The macro PCRE_UCHAR16 specifies an |
vectors of unsigned 16-bit quantities. The macro PCRE_UCHAR16 specifies an |
| 204 |
appropriate data type, and PCRE_SPTR16 is defined as "const PCRE_UCHAR16 *". In |
appropriate data type, and PCRE_SPTR16 is defined as "const PCRE_UCHAR16 *". In |
| 205 |
very many environments, "short int" is a 16-bit data type. When PCRE is built, |
very many environments, "short int" is a 16-bit data type. When PCRE is built, |
| 206 |
it defines PCRE_UCHAR16 as "short int", but checks that it really is a 16-bit |
it defines PCRE_UCHAR16 as "short int", but checks that it really is a 16-bit |
| 207 |
data type. If it is not, the build fails with an error message telling the |
data type. If it is not, the build fails with an error message telling the |
| 208 |
maintainer to modify the definition appropriately. |
maintainer to modify the definition appropriately. |
| 209 |
</P> |
</P> |
| 210 |
<br><a name="SEC10" href="#TOC1">STRUCTURE TYPES</a><br> |
<br><a name="SEC10" href="#TOC1">STRUCTURE TYPES</a><br> |
| 211 |
<P> |
<P> |
| 212 |
The types of the opaque structures that are used for compiled 16-bit patterns |
The types of the opaque structures that are used for compiled 16-bit patterns |
| 213 |
and JIT stacks are <b>pcre16</b> and <b>pcre16_jit_stack</b> respectively. The |
and JIT stacks are <b>pcre16</b> and <b>pcre16_jit_stack</b> respectively. The |
| 214 |
type of the user-accessible structure that is returned by <b>pcre16_study()</b> |
type of the user-accessible structure that is returned by <b>pcre16_study()</b> |
| 215 |
is <b>pcre16_extra</b>, and the type of the structure that is used for passing |
is <b>pcre16_extra</b>, and the type of the structure that is used for passing |
| 216 |
data to a callout function is <b>pcre16_callout_block</b>. These structures |
data to a callout function is <b>pcre16_callout_block</b>. These structures |
| 217 |
contain the same fields, with the same names, as their 8-bit counterparts. The |
contain the same fields, with the same names, as their 8-bit counterparts. The |
| 218 |
only difference is that pointers to character strings are 16-bit instead of |
only difference is that pointers to character strings are 16-bit instead of |
| 219 |
8-bit types. |
8-bit types. |
| 220 |
</P> |
</P> |
| 221 |
<br><a name="SEC11" href="#TOC1">16-BIT FUNCTIONS</a><br> |
<br><a name="SEC11" href="#TOC1">16-BIT FUNCTIONS</a><br> |
| 222 |
<P> |
<P> |
| 223 |
For every function in the 8-bit library there is a corresponding function in |
For every function in the 8-bit library there is a corresponding function in |
| 224 |
the 16-bit library with a name that starts with <b>pcre16_</b> instead of |
the 16-bit library with a name that starts with <b>pcre16_</b> instead of |
| 225 |
<b>pcre_</b>. The prototypes are listed above. In addition, there is one extra |
<b>pcre_</b>. The prototypes are listed above. In addition, there is one extra |
| 226 |
function, <b>pcre16_utf16_to_host_byte_order()</b>. This is a utility function |
function, <b>pcre16_utf16_to_host_byte_order()</b>. This is a utility function |
| 227 |
that converts a UTF-16 character string to host byte order if necessary. The |
that converts a UTF-16 character string to host byte order if necessary. The |
| 228 |
other 16-bit functions expect the strings they are passed to be in host byte |
other 16-bit functions expect the strings they are passed to be in host byte |
| 229 |
order. |
order. |
| 230 |
</P> |
</P> |
| 231 |
<P> |
<P> |
| 232 |
The <i>input</i> and <i>output</i> arguments of |
The <i>input</i> and <i>output</i> arguments of |
| 233 |
<b>pcre16_utf16_to_host_byte_order()</b> may point to the same address, that is, |
<b>pcre16_utf16_to_host_byte_order()</b> may point to the same address, that is, |
| 234 |
conversion in place is supported. The output buffer must be at least as long as |
conversion in place is supported. The output buffer must be at least as long as |
| 235 |
the input. |
the input. |
| 236 |
</P> |
</P> |
| 237 |
<P> |
<P> |
| 239 |
input string; a negative value specifies a zero-terminated string. |
input string; a negative value specifies a zero-terminated string. |
| 240 |
</P> |
</P> |
| 241 |
<P> |
<P> |
| 242 |
If <i>byte_order</i> is NULL, it is assumed that the string starts off in host |
If <i>byte_order</i> is NULL, it is assumed that the string starts off in host |
| 243 |
byte order. This may be changed by byte-order marks (BOMs) anywhere in the |
byte order. This may be changed by byte-order marks (BOMs) anywhere in the |
| 244 |
string (commonly as the first character). |
string (commonly as the first character). |
| 245 |
</P> |
</P> |
| 246 |
<P> |
<P> |
| 247 |
If <i>byte_order</i> is not NULL, a non-zero value of the integer to which it |
If <i>byte_order</i> is not NULL, a non-zero value of the integer to which it |
| 248 |
points means that the input starts off in host byte order, otherwise the |
points means that the input starts off in host byte order, otherwise the |
| 249 |
opposite order is assumed. Again, BOMs in the string can change this. The final |
opposite order is assumed. Again, BOMs in the string can change this. The final |
| 250 |
byte order is passed back at the end of processing. |
byte order is passed back at the end of processing. |
| 251 |
</P> |
</P> |
| 252 |
<P> |
<P> |
| 253 |
If <i>keep_boms</i> is not zero, byte-order mark characters (0xfeff) are copied |
If <i>keep_boms</i> is not zero, byte-order mark characters (0xfeff) are copied |
| 254 |
into the output string. Otherwise they are discarded. |
into the output string. Otherwise they are discarded. |
| 255 |
</P> |
</P> |
| 256 |
<P> |
<P> |
| 259 |
</P> |
</P> |
| 260 |
<br><a name="SEC12" href="#TOC1">SUBJECT STRING OFFSETS</a><br> |
<br><a name="SEC12" href="#TOC1">SUBJECT STRING OFFSETS</a><br> |
| 261 |
<P> |
<P> |
| 262 |
The offsets within subject strings that are returned by the matching functions |
The offsets within subject strings that are returned by the matching functions |
| 263 |
are in 16-bit units rather than bytes. |
are in 16-bit units rather than bytes. |
| 264 |
</P> |
</P> |
| 265 |
<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br> |
<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br> |
| 266 |
<P> |
<P> |
| 267 |
The name-to-number translation table that is maintained for named subpatterns |
The name-to-number translation table that is maintained for named subpatterns |
| 268 |
uses 16-bit characters. The <b>pcre16_get_stringtable_entries()</b> function |
uses 16-bit characters. The <b>pcre16_get_stringtable_entries()</b> function |
| 269 |
returns the length of each entry in the table as the number of 16-bit data |
returns the length of each entry in the table as the number of 16-bit data |
| 270 |
units. |
units. |
| 271 |
</P> |
</P> |
| 272 |
<br><a name="SEC14" href="#TOC1">OPTION NAMES</a><br> |
<br><a name="SEC14" href="#TOC1">OPTION NAMES</a><br> |
| 276 |
fact, these new options define the same bits in the options word. |
fact, these new options define the same bits in the options word. |
| 277 |
</P> |
</P> |
| 278 |
<P> |
<P> |
| 279 |
For the <b>pcre16_config()</b> function there is an option PCRE_CONFIG_UTF16 |
For the <b>pcre16_config()</b> function there is an option PCRE_CONFIG_UTF16 |
| 280 |
that returns 1 if UTF-16 support is configured, otherwise 0. If this option is |
that returns 1 if UTF-16 support is configured, otherwise 0. If this option is |
| 281 |
given to <b>pcre_config()</b>, or if the PCRE_CONFIG_UTF8 option is given to |
given to <b>pcre_config()</b>, or if the PCRE_CONFIG_UTF8 option is given to |
| 282 |
<b>pcre16_config()</b>, the result is the PCRE_ERROR_BADOPTION error. |
<b>pcre16_config()</b>, the result is the PCRE_ERROR_BADOPTION error. |
| 283 |
</P> |
</P> |
| 284 |
<br><a name="SEC15" href="#TOC1">CHARACTER CODES</a><br> |
<br><a name="SEC15" href="#TOC1">CHARACTER CODES</a><br> |
| 285 |
<P> |
<P> |
| 286 |
In 16-bit mode, when PCRE_UTF16 is not set, character values are treated in the |
In 16-bit mode, when PCRE_UTF16 is not set, character values are treated in the |
| 287 |
same way as in 8-bit, non UTF-8 mode, except, of course, that they can range |
same way as in 8-bit, non UTF-8 mode, except, of course, that they can range |
| 288 |
from 0 to 0xffff instead of 0 to 0xff. Character types for characters less than |
from 0 to 0xffff instead of 0 to 0xff. Character types for characters less than |
| 289 |
0xff can therefore be influenced by the locale in the same way as before. |
0xff can therefore be influenced by the locale in the same way as before. |
| 290 |
Characters greater than 0xff have only one case, and no "type" (such as letter |
Characters greater than 0xff have only one case, and no "type" (such as letter |
| 291 |
or digit). |
or digit). |
| 292 |
</P> |
</P> |
| 293 |
<P> |
<P> |
| 294 |
In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with |
In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with |
| 295 |
the exception of values in the range 0xd800 to 0xdfff because those are |
the exception of values in the range 0xd800 to 0xdfff because those are |
| 296 |
"surrogate" values that are used in pairs to encode values greater than 0xffff. |
"surrogate" values that are used in pairs to encode values greater than 0xffff. |
| 297 |
</P> |
</P> |
| 298 |
<P> |
<P> |
| 299 |
A UTF-16 string can indicate its endianness by special code knows as a |
A UTF-16 string can indicate its endianness by special code knows as a |
| 300 |
byte-order mark (BOM). The PCRE functions do not handle this, expecting strings |
byte-order mark (BOM). The PCRE functions do not handle this, expecting strings |
| 301 |
to be in host byte order. A utility function called |
to be in host byte order. A utility function called |
| 302 |
<b>pcre16_utf16_to_host_byte_order()</b> is provided to help with this (see |
<b>pcre16_utf16_to_host_byte_order()</b> is provided to help with this (see |
| 304 |
</P> |
</P> |
| 305 |
<br><a name="SEC16" href="#TOC1">ERROR NAMES</a><br> |
<br><a name="SEC16" href="#TOC1">ERROR NAMES</a><br> |
| 306 |
<P> |
<P> |
| 307 |
The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 correspond to |
The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 correspond to |
| 308 |
their 8-bit counterparts. The error PCRE_ERROR_BADMODE is given when a compiled |
their 8-bit counterparts. The error PCRE_ERROR_BADMODE is given when a compiled |
| 309 |
pattern is passed to a function that processes patterns in the other |
pattern is passed to a function that processes patterns in the other |
| 310 |
mode, for example, if a pattern compiled with <b>pcre_compile()</b> is passed to |
mode, for example, if a pattern compiled with <b>pcre_compile()</b> is passed to |
| 311 |
<b>pcre16_exec()</b>. |
<b>pcre16_exec()</b>. |
| 312 |
</P> |
</P> |
| 313 |
<P> |
<P> |
| 314 |
There are new error codes whose names begin with PCRE_UTF16_ERR for invalid |
There are new error codes whose names begin with PCRE_UTF16_ERR for invalid |
| 315 |
UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings that |
UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings that |
| 316 |
are described in the section entitled |
are described in the section entitled |
| 317 |
<a href="pcreapi.html#badutf8reasons">"Reason codes for invalid UTF-8 strings"</a> |
<a href="pcreapi.html#badutf8reasons">"Reason codes for invalid UTF-8 strings"</a> |
| 318 |
in the main |
in the main |
| 319 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
| 320 |
page. The UTF-16 errors are: |
page. The UTF-16 errors are: |
| 321 |
<pre> |
<pre> |
| 327 |
</P> |
</P> |
| 328 |
<br><a name="SEC17" href="#TOC1">ERROR TEXTS</a><br> |
<br><a name="SEC17" href="#TOC1">ERROR TEXTS</a><br> |
| 329 |
<P> |
<P> |
| 330 |
If there is an error while compiling a pattern, the error text that is passed |
If there is an error while compiling a pattern, the error text that is passed |
| 331 |
back by <b>pcre16_compile()</b> or <b>pcre16_compile2()</b> is still an 8-bit |
back by <b>pcre16_compile()</b> or <b>pcre16_compile2()</b> is still an 8-bit |
| 332 |
character string, zero-terminated. |
character string, zero-terminated. |
| 333 |
</P> |
</P> |
| 334 |
<br><a name="SEC18" href="#TOC1">CALLOUTS</a><br> |
<br><a name="SEC18" href="#TOC1">CALLOUTS</a><br> |
| 338 |
</P> |
</P> |
| 339 |
<br><a name="SEC19" href="#TOC1">TESTING</a><br> |
<br><a name="SEC19" href="#TOC1">TESTING</a><br> |
| 340 |
<P> |
<P> |
| 341 |
The <b>pcretest</b> program continues to operate with 8-bit input and output |
The <b>pcretest</b> program continues to operate with 8-bit input and output |
| 342 |
files, but it can be used for testing the 16-bit library. If it is run with the |
files, but it can be used for testing the 16-bit library. If it is run with the |
| 343 |
command line option <b>-16</b>, patterns and subject strings are converted from |
command line option <b>-16</b>, patterns and subject strings are converted from |
| 344 |
8-bit to 16-bit before being passed to PCRE, and the 16-bit library functions |
8-bit to 16-bit before being passed to PCRE, and the 16-bit library functions |
| 345 |
are used instead of the 8-bit ones. Returned 16-bit strings are converted to |
are used instead of the 8-bit ones. Returned 16-bit strings are converted to |
| 346 |
8-bit for output. If the 8-bit library was not compiled, <b>pcretest</b> |
8-bit for output. If the 8-bit library was not compiled, <b>pcretest</b> |
| 347 |
defaults to 16-bit and the <b>-16</b> option is ignored. |
defaults to 16-bit and the <b>-16</b> option is ignored. |
| 348 |
</P> |
</P> |
| 349 |
<P> |
<P> |
| 350 |
When PCRE is being built, the <b>RunTest</b> script that is called by "make |
When PCRE is being built, the <b>RunTest</b> script that is called by "make |
| 351 |
check" uses the <b>pcretest</b> <b>-C</b> option to discover which of the 8-bit |
check" uses the <b>pcretest</b> <b>-C</b> option to discover which of the 8-bit |
| 352 |
and 16-bit libraries has been built, and runs the tests appropriately. |
and 16-bit libraries has been built, and runs the tests appropriately. |
| 353 |
</P> |
</P> |
| 354 |
<br><a name="SEC20" href="#TOC1">NOT SUPPORTED IN 16-BIT MODE</a><br> |
<br><a name="SEC20" href="#TOC1">NOT SUPPORTED IN 16-BIT MODE</a><br> |
| 355 |
<P> |
<P> |
| 356 |
Not all the features of the 8-bit library are available with the 16-bit |
Not all the features of the 8-bit library are available with the 16-bit |
| 357 |
library. The C++ and POSIX wrapper functions support only the 8-bit library, |
library. The C++ and POSIX wrapper functions support only the 8-bit library, |
| 358 |
and the <b>pcregrep</b> program is at present 8-bit only. |
and the <b>pcregrep</b> program is at present 8-bit only. |
| 359 |
</P> |
</P> |
| 360 |
<br><a name="SEC21" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC21" href="#TOC1">AUTHOR</a><br> |