/[pcre]/code/trunk/doc/pcre.3
ViewVC logotype

Diff of /code/trunk/doc/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 208 by ph10, Mon Aug 6 15:23:29 2007 UTC revision 209 by ph10, Tue Aug 7 09:22:06 2007 UTC
# Line 186  compatibility with Perl 5.6. PCRE does n Line 186  compatibility with Perl 5.6. PCRE does n
186  The following comments apply when PCRE is running in UTF-8 mode:  The following comments apply when PCRE is running in UTF-8 mode:
187  .P  .P
188  1. When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects  1. When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects
189  are checked for validity on entry to the relevant functions. If an invalid  are checked for validity on entry to the relevant functions. Note that the
190  UTF-8 string is passed, an error return is given. In some situations, you may  check is for a syntactically valid UTF-8 byte string, as defined by RFC 2279.
191  already know that your strings are valid, and therefore want to skip these  It is \fInot\fP a check for a UTF-8 string of assigned or allowable Unicode
192  checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag  code points. For example, the byte sequence \exED\exB2\ex94 is a valid UTF-8
193  at compile time or at run time, PCRE assumes that the pattern or subject it  encoding of the code point U+DC94, and is not rejected by PCRE. However, that
194  is given (respectively) contains only valid UTF-8 codes. In this case, it does  code point is in the "Low Surrogate Area" of Unicode, of which the Unicode
195  not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to  Standard says this: "The Low Surrogate Area does not contain any character
196  PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program  assignments, consequently no character code charts or namelists are provided
197  may crash.  for this area. Surrogates are reserved for use with UTF-16 and then must be
198    used in pairs."
199    .P
200    The reason for the UTF-8 check at the start is so that the rest of PCRE can
201    assume that UTF-8 strings are well formed. There is no intention of
202    interpreting the values of the code points, which would involve more processing
203    and affect performance.
204    .P
205    If a syntactically invalid UTF-8 string is passed, an error return is given. In
206    some situations, you may already know that your strings are valid, and
207    therefore want to skip these checks in order to improve performance. If you set
208    the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that
209    the pattern or subject it is given (respectively) contains only valid UTF-8
210    codes. In this case, it does not diagnose an invalid UTF-8 string. If you pass
211    an invalid UTF-8 string to PCRE when PCRE_NO_UTF8_CHECK is set, the results are
212    undefined. Your program may crash.
213  .P  .P
214  2. An unbraced hexadecimal escape sequence (such as \exb3) matches a two-byte  2. An unbraced hexadecimal escape sequence (such as \exb3) matches a two-byte
215  UTF-8 character if the value is greater than 127.  UTF-8 character if the value is greater than 127.
# Line 254  two digits 10, at the domain cam.ac.uk. Line 269  two digits 10, at the domain cam.ac.uk.
269  .rs  .rs
270  .sp  .sp
271  .nf  .nf
272  Last updated: 06 August 2007  Last updated: 07 August 2007
273  Copyright (c) 1997-2007 University of Cambridge.  Copyright (c) 1997-2007 University of Cambridge.
274  .fi  .fi

Legend:
Removed from v.208  
changed lines
  Added in v.209

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12