/[pcre]/code/trunk/doc/html/pcre16.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcre16.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 869 by ph10, Sat Jan 14 11:16:23 2012 UTC revision 903 by ph10, Sat Jan 21 16:37:17 2012 UTC
# Line 160  man page, in case the conversion went wr Line 160  man page, in case the conversion went wr
160  <br><a name="SEC5" href="#TOC1">PCRE 16-BIT API 16-BIT-ONLY FUNCTION</a><br>  <br><a name="SEC5" href="#TOC1">PCRE 16-BIT API 16-BIT-ONLY FUNCTION</a><br>
161  <P>  <P>
162  <b>int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *<i>output</i>,</b>  <b>int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *<i>output</i>,</b>
163  <b>PCRE_SPTR16 <i>input</i>, int <i>length</i>, int *<i>byte_order</i>, </b>  <b>PCRE_SPTR16 <i>input</i>, int <i>length</i>, int *<i>byte_order</i>,</b>
164  <b>int <i>keep_boms</i>);</b>  <b>int <i>keep_boms</i>);</b>
165  </P>  </P>
166  <br><a name="SEC6" href="#TOC1">THE PCRE 16-BIT LIBRARY</a><br>  <br><a name="SEC6" href="#TOC1">THE PCRE 16-BIT LIBRARY</a><br>
# Line 177  to the 16-bit library. This page describ Line 177  to the 16-bit library. This page describ
177  16-bit library.  16-bit library.
178  </P>  </P>
179  <P>  <P>
180  WARNING: A single application can be linked with both libraries, but you must  WARNING: A single application can be linked with both libraries, but you must
181  take care when processing any particular pattern to use functions from just one  take care when processing any particular pattern to use functions from just one
182  library. For example, if you want to study a pattern that was compiled with  library. For example, if you want to study a pattern that was compiled with
183  <b>pcre16_compile()</b>, you must do so with <b>pcre16_study()</b>, not  <b>pcre16_compile()</b>, you must do so with <b>pcre16_study()</b>, not
184  <b>pcre_study()</b>, and you must free the study data with  <b>pcre_study()</b>, and you must free the study data with
# Line 186  library. For example, if you want to stu Line 186  library. For example, if you want to stu
186  </P>  </P>
187  <br><a name="SEC7" href="#TOC1">THE HEADER FILE</a><br>  <br><a name="SEC7" href="#TOC1">THE HEADER FILE</a><br>
188  <P>  <P>
189  There is only one header file, <b>pcre.h</b>. It contains prototypes for all the  There is only one header file, <b>pcre.h</b>. It contains prototypes for all the
190  functions in both libraries, as well as definitions of flags, structures, error  functions in both libraries, as well as definitions of flags, structures, error
191  codes, etc.  codes, etc.
192  </P>  </P>
193  <br><a name="SEC8" href="#TOC1">THE LIBRARY NAME</a><br>  <br><a name="SEC8" href="#TOC1">THE LIBRARY NAME</a><br>
194  <P>  <P>
195  In Unix-like systems, the 16-bit library is called <b>libpcre16</b>, and can  In Unix-like systems, the 16-bit library is called <b>libpcre16</b>, and can
196  normally be accesss by adding <b>-lpcre16</b> to the command for linking an  normally be accesss by adding <b>-lpcre16</b> to the command for linking an
197  application that uses PCRE.  application that uses PCRE.
198  </P>  </P>
199  <br><a name="SEC9" href="#TOC1">STRING TYPES</a><br>  <br><a name="SEC9" href="#TOC1">STRING TYPES</a><br>
200  <P>  <P>
201  In the 8-bit library, strings are passed to PCRE library functions as vectors  In the 8-bit library, strings are passed to PCRE library functions as vectors
202  of bytes with the C type "char *". In the 16-bit library, strings are passed as  of bytes with the C type "char *". In the 16-bit library, strings are passed as
203  vectors of unsigned 16-bit quantities. The macro PCRE_UCHAR16 specifies an  vectors of unsigned 16-bit quantities. The macro PCRE_UCHAR16 specifies an
204  appropriate data type, and PCRE_SPTR16 is defined as "const PCRE_UCHAR16 *". In  appropriate data type, and PCRE_SPTR16 is defined as "const PCRE_UCHAR16 *". In
205  very many environments, "short int" is a 16-bit data type. When PCRE is built,  very many environments, "short int" is a 16-bit data type. When PCRE is built,
206  it defines PCRE_UCHAR16 as "short int", but checks that it really is a 16-bit  it defines PCRE_UCHAR16 as "short int", but checks that it really is a 16-bit
207  data type. If it is not, the build fails with an error message telling the  data type. If it is not, the build fails with an error message telling the
208  maintainer to modify the definition appropriately.  maintainer to modify the definition appropriately.
209  </P>  </P>
210  <br><a name="SEC10" href="#TOC1">STRUCTURE TYPES</a><br>  <br><a name="SEC10" href="#TOC1">STRUCTURE TYPES</a><br>
211  <P>  <P>
212  The types of the opaque structures that are used for compiled 16-bit patterns  The types of the opaque structures that are used for compiled 16-bit patterns
213  and JIT stacks are <b>pcre16</b> and <b>pcre16_jit_stack</b> respectively. The  and JIT stacks are <b>pcre16</b> and <b>pcre16_jit_stack</b> respectively. The
214  type of the user-accessible structure that is returned by <b>pcre16_study()</b>  type of the user-accessible structure that is returned by <b>pcre16_study()</b>
215  is <b>pcre16_extra</b>, and the type of the structure that is used for passing  is <b>pcre16_extra</b>, and the type of the structure that is used for passing
216  data to a callout function is <b>pcre16_callout_block</b>. These structures  data to a callout function is <b>pcre16_callout_block</b>. These structures
217  contain the same fields, with the same names, as their 8-bit counterparts. The  contain the same fields, with the same names, as their 8-bit counterparts. The
218  only difference is that pointers to character strings are 16-bit instead of  only difference is that pointers to character strings are 16-bit instead of
219  8-bit types.  8-bit types.
220  </P>  </P>
221  <br><a name="SEC11" href="#TOC1">16-BIT FUNCTIONS</a><br>  <br><a name="SEC11" href="#TOC1">16-BIT FUNCTIONS</a><br>
222  <P>  <P>
223  For every function in the 8-bit library there is a corresponding function in  For every function in the 8-bit library there is a corresponding function in
224  the 16-bit library with a name that starts with <b>pcre16_</b> instead of  the 16-bit library with a name that starts with <b>pcre16_</b> instead of
225  <b>pcre_</b>. The prototypes are listed above. In addition, there is one extra  <b>pcre_</b>. The prototypes are listed above. In addition, there is one extra
226  function, <b>pcre16_utf16_to_host_byte_order()</b>. This is a utility function  function, <b>pcre16_utf16_to_host_byte_order()</b>. This is a utility function
227  that converts a UTF-16 character string to host byte order if necessary. The  that converts a UTF-16 character string to host byte order if necessary. The
228  other 16-bit functions expect the strings they are passed to be in host byte  other 16-bit functions expect the strings they are passed to be in host byte
229  order.  order.
230  </P>  </P>
231  <P>  <P>
232  The <i>input</i> and <i>output</i> arguments of  The <i>input</i> and <i>output</i> arguments of
233  <b>pcre16_utf16_to_host_byte_order()</b> may point to the same address, that is,  <b>pcre16_utf16_to_host_byte_order()</b> may point to the same address, that is,
234  conversion in place is supported. The output buffer must be at least as long as  conversion in place is supported. The output buffer must be at least as long as
235  the input.  the input.
236  </P>  </P>
237  <P>  <P>
# Line 239  The length argument specifies the Line 239  The length argument specifies the
239  input string; a negative value specifies a zero-terminated string.  input string; a negative value specifies a zero-terminated string.
240  </P>  </P>
241  <P>  <P>
242  If <i>byte_order</i> is NULL, it is assumed that the string starts off in host  If <i>byte_order</i> is NULL, it is assumed that the string starts off in host
243  byte order. This may be changed by byte-order marks (BOMs) anywhere in the  byte order. This may be changed by byte-order marks (BOMs) anywhere in the
244  string (commonly as the first character).  string (commonly as the first character).
245  </P>  </P>
246  <P>  <P>
247  If <i>byte_order</i> is not NULL, a non-zero value of the integer to which it  If <i>byte_order</i> is not NULL, a non-zero value of the integer to which it
248  points means that the input starts off in host byte order, otherwise the  points means that the input starts off in host byte order, otherwise the
249  opposite order is assumed. Again, BOMs in the string can change this. The final  opposite order is assumed. Again, BOMs in the string can change this. The final
250  byte order is passed back at the end of processing.  byte order is passed back at the end of processing.
251  </P>  </P>
252  <P>  <P>
253  If <i>keep_boms</i> is not zero, byte-order mark characters (0xfeff) are copied  If <i>keep_boms</i> is not zero, byte-order mark characters (0xfeff) are copied
254  into the output string. Otherwise they are discarded.  into the output string. Otherwise they are discarded.
255  </P>  </P>
256  <P>  <P>
# Line 259  buffer, including the zero terminator if Line 259  buffer, including the zero terminator if
259  </P>  </P>
260  <br><a name="SEC12" href="#TOC1">SUBJECT STRING OFFSETS</a><br>  <br><a name="SEC12" href="#TOC1">SUBJECT STRING OFFSETS</a><br>
261  <P>  <P>
262  The offsets within subject strings that are returned by the matching functions  The offsets within subject strings that are returned by the matching functions
263  are in 16-bit units rather than bytes.  are in 16-bit units rather than bytes.
264  </P>  </P>
265  <br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br>  <br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br>
266  <P>  <P>
267  The name-to-number translation table that is maintained for named subpatterns  The name-to-number translation table that is maintained for named subpatterns
268  uses 16-bit characters. The <b>pcre16_get_stringtable_entries()</b> function  uses 16-bit characters. The <b>pcre16_get_stringtable_entries()</b> function
269  returns the length of each entry in the table as the number of 16-bit data  returns the length of each entry in the table as the number of 16-bit data
270  units.  units.
271  </P>  </P>
272  <br><a name="SEC14" href="#TOC1">OPTION NAMES</a><br>  <br><a name="SEC14" href="#TOC1">OPTION NAMES</a><br>
# Line 276  which correspond to PCRE_UTF8 and PCRE_N Line 276  which correspond to PCRE_UTF8 and PCRE_N
276  fact, these new options define the same bits in the options word.  fact, these new options define the same bits in the options word.
277  </P>  </P>
278  <P>  <P>
279  For the <b>pcre16_config()</b> function there is an option PCRE_CONFIG_UTF16  For the <b>pcre16_config()</b> function there is an option PCRE_CONFIG_UTF16
280  that returns 1 if UTF-16 support is configured, otherwise 0. If this option is  that returns 1 if UTF-16 support is configured, otherwise 0. If this option is
281  given to <b>pcre_config()</b>, or if the PCRE_CONFIG_UTF8 option is given to  given to <b>pcre_config()</b>, or if the PCRE_CONFIG_UTF8 option is given to
282  <b>pcre16_config()</b>, the result is the PCRE_ERROR_BADOPTION error.  <b>pcre16_config()</b>, the result is the PCRE_ERROR_BADOPTION error.
283  </P>  </P>
284  <br><a name="SEC15" href="#TOC1">CHARACTER CODES</a><br>  <br><a name="SEC15" href="#TOC1">CHARACTER CODES</a><br>
285  <P>  <P>
286  In 16-bit mode, when PCRE_UTF16 is not set, character values are treated in the  In 16-bit mode, when PCRE_UTF16 is not set, character values are treated in the
287  same way as in 8-bit, non UTF-8 mode, except, of course, that they can range  same way as in 8-bit, non UTF-8 mode, except, of course, that they can range
288  from 0 to 0xffff instead of 0 to 0xff. Character types for characters less than  from 0 to 0xffff instead of 0 to 0xff. Character types for characters less than
289  0xff can therefore be influenced by the locale in the same way as before.  0xff can therefore be influenced by the locale in the same way as before.
290  Characters greater than 0xff have only one case, and no "type" (such as letter  Characters greater than 0xff have only one case, and no "type" (such as letter
291  or digit).  or digit).
292  </P>  </P>
293  <P>  <P>
294  In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with  In UTF-16 mode, the character code is Unicode, in the range 0 to 0x10ffff, with
295  the exception of values in the range 0xd800 to 0xdfff because those are  the exception of values in the range 0xd800 to 0xdfff because those are
296  "surrogate" values that are used in pairs to encode values greater than 0xffff.  "surrogate" values that are used in pairs to encode values greater than 0xffff.
297  </P>  </P>
298  <P>  <P>
299  A UTF-16 string can indicate its endianness by special code knows as a  A UTF-16 string can indicate its endianness by special code knows as a
300  byte-order mark (BOM). The PCRE functions do not handle this, expecting strings  byte-order mark (BOM). The PCRE functions do not handle this, expecting strings
301  to be in host byte order. A utility function called  to be in host byte order. A utility function called
302  <b>pcre16_utf16_to_host_byte_order()</b> is provided to help with this (see  <b>pcre16_utf16_to_host_byte_order()</b> is provided to help with this (see
# Line 304  above). Line 304  above).
304  </P>  </P>
305  <br><a name="SEC16" href="#TOC1">ERROR NAMES</a><br>  <br><a name="SEC16" href="#TOC1">ERROR NAMES</a><br>
306  <P>  <P>
307  The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 correspond to  The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 correspond to
308  their 8-bit counterparts. The error PCRE_ERROR_BADMODE is given when a compiled  their 8-bit counterparts. The error PCRE_ERROR_BADMODE is given when a compiled
309  pattern is passed to a function that processes patterns in the other  pattern is passed to a function that processes patterns in the other
310  mode, for example, if a pattern compiled with <b>pcre_compile()</b> is passed to  mode, for example, if a pattern compiled with <b>pcre_compile()</b> is passed to
311  <b>pcre16_exec()</b>.  <b>pcre16_exec()</b>.
312  </P>  </P>
313  <P>  <P>
314  There are new error codes whose names begin with PCRE_UTF16_ERR for invalid  There are new error codes whose names begin with PCRE_UTF16_ERR for invalid
315  UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings that  UTF-16 strings, corresponding to the PCRE_UTF8_ERR codes for UTF-8 strings that
316  are described in the section entitled  are described in the section entitled
317  <a href="pcreapi.html#badutf8reasons">"Reason codes for invalid UTF-8 strings"</a>  <a href="pcreapi.html#badutf8reasons">"Reason codes for invalid UTF-8 strings"</a>
318  in the main  in the main
319  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
320  page. The UTF-16 errors are:  page. The UTF-16 errors are:
321  <pre>  <pre>
# Line 327  page. The UTF-16 errors are: Line 327  page. The UTF-16 errors are:
327  </P>  </P>
328  <br><a name="SEC17" href="#TOC1">ERROR TEXTS</a><br>  <br><a name="SEC17" href="#TOC1">ERROR TEXTS</a><br>
329  <P>  <P>
330  If there is an error while compiling a pattern, the error text that is passed  If there is an error while compiling a pattern, the error text that is passed
331  back by <b>pcre16_compile()</b> or <b>pcre16_compile2()</b> is still an 8-bit  back by <b>pcre16_compile()</b> or <b>pcre16_compile2()</b> is still an 8-bit
332  character string, zero-terminated.  character string, zero-terminated.
333  </P>  </P>
334  <br><a name="SEC18" href="#TOC1">CALLOUTS</a><br>  <br><a name="SEC18" href="#TOC1">CALLOUTS</a><br>
# Line 338  a callout function point to 16-bit vecto Line 338  a callout function point to 16-bit vecto
338  </P>  </P>
339  <br><a name="SEC19" href="#TOC1">TESTING</a><br>  <br><a name="SEC19" href="#TOC1">TESTING</a><br>
340  <P>  <P>
341  The <b>pcretest</b> program continues to operate with 8-bit input and output  The <b>pcretest</b> program continues to operate with 8-bit input and output
342  files, but it can be used for testing the 16-bit library. If it is run with the  files, but it can be used for testing the 16-bit library. If it is run with the
343  command line option <b>-16</b>, patterns and subject strings are converted from  command line option <b>-16</b>, patterns and subject strings are converted from
344  8-bit to 16-bit before being passed to PCRE, and the 16-bit library functions  8-bit to 16-bit before being passed to PCRE, and the 16-bit library functions
345  are used instead of the 8-bit ones. Returned 16-bit strings are converted to  are used instead of the 8-bit ones. Returned 16-bit strings are converted to
346  8-bit for output. If the 8-bit library was not compiled, <b>pcretest</b>  8-bit for output. If the 8-bit library was not compiled, <b>pcretest</b>
347  defaults to 16-bit and the <b>-16</b> option is ignored.  defaults to 16-bit and the <b>-16</b> option is ignored.
348  </P>  </P>
349  <P>  <P>
350  When PCRE is being built, the <b>RunTest</b> script that is called by "make  When PCRE is being built, the <b>RunTest</b> script that is called by "make
351  check" uses the <b>pcretest</b> <b>-C</b> option to discover which of the 8-bit  check" uses the <b>pcretest</b> <b>-C</b> option to discover which of the 8-bit
352  and 16-bit libraries has been built, and runs the tests appropriately.  and 16-bit libraries has been built, and runs the tests appropriately.
353  </P>  </P>
354  <br><a name="SEC20" href="#TOC1">NOT SUPPORTED IN 16-BIT MODE</a><br>  <br><a name="SEC20" href="#TOC1">NOT SUPPORTED IN 16-BIT MODE</a><br>
355  <P>  <P>
356  Not all the features of the 8-bit library are available with the 16-bit  Not all the features of the 8-bit library are available with the 16-bit
357  library. The C++ and POSIX wrapper functions support only the 8-bit library,  library. The C++ and POSIX wrapper functions support only the 8-bit library,
358  and the <b>pcregrep</b> program is at present 8-bit only.  and the <b>pcregrep</b> program is at present 8-bit only.
359  </P>  </P>
360  <br><a name="SEC21" href="#TOC1">AUTHOR</a><br>  <br><a name="SEC21" href="#TOC1">AUTHOR</a><br>

Legend:
Removed from v.869  
changed lines
  Added in v.903

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12