/[pcre]/code/trunk/doc/pcre.3
ViewVC logotype

Diff of /code/trunk/doc/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 47 by nigel, Sat Feb 24 21:39:29 2007 UTC revision 49 by nigel, Sat Feb 24 21:39:33 2007 UTC
# Line 44  pcre - Perl-compatible regular expressio Line 44  pcre - Perl-compatible regular expressio
44  .B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);"  .B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);"
45  .PP  .PP
46  .br  .br
47    .B void pcre_free_substring(const char *\fIstringptr\fR);
48    .PP
49    .br
50    .B void pcre_free_substring_list(const char **\fIstringptr\fR);
51    .PP
52    .br
53  .B const unsigned char *pcre_maketables(void);  .B const unsigned char *pcre_maketables(void);
54  .PP  .PP
55  .br  .br
# Line 70  pcre - Perl-compatible regular expressio Line 76  pcre - Perl-compatible regular expressio
76  The PCRE library is a set of functions that implement regular expression  The PCRE library is a set of functions that implement regular expression
77  pattern matching using the same syntax and semantics as Perl 5, with just a few  pattern matching using the same syntax and semantics as Perl 5, with just a few
78  differences (see below). The current implementation corresponds to Perl 5.005,  differences (see below). The current implementation corresponds to Perl 5.005,
79  with some additional features from the Perl development release.  with some additional features from later versions. This includes some
80    experimental, incomplete support for UTF-8 encoded strings. Details of exactly
81    what is and what is not supported are given below.
82    
83  PCRE has its own native API, which is described in this document. There is also  PCRE has its own native API, which is described in this document. There is also
84  a set of wrapper functions that correspond to the POSIX regular expression API.  a set of wrapper functions that correspond to the POSIX regular expression API.
# Line 84  contain the major and minor release numb Line 92  contain the major and minor release numb
92  use these to include support for different releases.  use these to include support for different releases.
93    
94  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
95  are used for compiling and matching regular expressions, while  are used for compiling and matching regular expressions.
96  \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and  
97    The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
98  \fBpcre_get_substring_list()\fR are convenience functions for extracting  \fBpcre_get_substring_list()\fR are convenience functions for extracting
99  captured substrings from a matched subject string. The function  captured substrings from a matched subject string; \fBpcre_free_substring()\fR
100  \fBpcre_maketables()\fR is used (optionally) to build a set of character tables  and \fBpcre_free_substring_list()\fR are also provided, to free the memory used
101  in the current locale for passing to \fBpcre_compile()\fR.  for extracted strings.
102    
103    The function \fBpcre_maketables()\fR is used (optionally) to build a set of
104    character tables in the current locale for passing to \fBpcre_compile()\fR.
105    
106  The function \fBpcre_fullinfo()\fR is used to find out information about a  The function \fBpcre_fullinfo()\fR is used to find out information about a
107  compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only  compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only
# Line 223  This option inverts the "greediness" of Line 235  This option inverts the "greediness" of
235  greedy by default, but become greedy if followed by "?". It is not compatible  greedy by default, but become greedy if followed by "?". It is not compatible
236  with Perl. It can also be set by a (?U) option setting within the pattern.  with Perl. It can also be set by a (?U) option setting within the pattern.
237    
238      PCRE_UTF8
239    
240    This option causes PCRE to regard both the pattern and the subject as strings
241    of UTF-8 characters instead of just byte strings. However, it is available only
242    if PCRE has been built to include UTF-8 support. If not, the use of this option
243    provokes an error. Support for UTF-8 is new, experimental, and incomplete.
244    Details of exactly what it entails are given below.
245    
246    
247  .SH STUDYING A PATTERN  .SH STUDYING A PATTERN
248  When a pattern is going to be used several times, it is worth spending more  When a pattern is going to be used several times, it is worth spending more
# Line 558  extract a single substring, whose number Line 578  extract a single substring, whose number
578  value of zero extracts the substring that matched the entire pattern, while  value of zero extracts the substring that matched the entire pattern, while
579  higher values extract the captured substrings. For \fBpcre_copy_substring()\fR,  higher values extract the captured substrings. For \fBpcre_copy_substring()\fR,
580  the string is placed in \fIbuffer\fR, whose length is given by  the string is placed in \fIbuffer\fR, whose length is given by
581  \fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is  \fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is
582  obtained via \fBpcre_malloc\fR, and its address is returned via  obtained via \fBpcre_malloc\fR, and its address is returned via
583  \fIstringptr\fR. The yield of the function is the length of the string, not  \fIstringptr\fR. The yield of the function is the length of the string, not
584  including the terminating zero, or one of  including the terminating zero, or one of
# Line 590  string. This can be distinguished from a Line 610  string. This can be distinguished from a
610  inspecting the appropriate offset in \fIovector\fR, which is negative for unset  inspecting the appropriate offset in \fIovector\fR, which is negative for unset
611  substrings.  substrings.
612    
613    The two convenience functions \fBpcre_free_substring()\fR and
614    \fBpcre_free_substring_list()\fR can be used to free the memory returned by
615    a previous call of \fBpcre_get_substring()\fR or
616    \fBpcre_get_substring_list()\fR, respectively. They do nothing more than call
617    the function pointed to by \fBpcre_free\fR, which of course could be called
618    directly from a C program. However, PCRE is used in some situations where it is
619    linked via a special interface to another programming language which cannot use
620    \fBpcre_free\fR directly; it is for these cases that the functions are
621    provided.
622    
623    
624  .SH LIMITATIONS  .SH LIMITATIONS
# Line 691  The syntax and semantics of the regular Line 720  The syntax and semantics of the regular
720  described below. Regular expressions are also described in the Perl  described below. Regular expressions are also described in the Perl
721  documentation and in a number of other books, some of which have copious  documentation and in a number of other books, some of which have copious
722  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
723  O'Reilly (ISBN 1-56592-257), covers them in great detail. The description  O'Reilly (ISBN 1-56592-257), covers them in great detail.
724  here is intended as reference documentation.  
725    The description here is intended as reference documentation. The basic
726    operation of PCRE is on strings of bytes. However, there is the beginnings of
727    some support for UTF-8 character strings. To use this support you must
728    configure PCRE to include it, and then call \fBpcre_compile()\fR with the
729    PCRE_UTF8 option. How this affects the pattern matching is described in the
730    final section of this document.
731    
732  A regular expression is a pattern that is matched against a subject string from  A regular expression is a pattern that is matched against a subject string from
733  left to right. Most characters stand for themselves in a pattern, and match the  left to right. Most characters stand for themselves in a pattern, and match the
# Line 1311  example, the pattern Line 1346  example, the pattern
1346    
1347    (a|b\\1)+    (a|b\\1)+
1348    
1349  matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of  matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
1350  the subpattern, the back reference matches the character string corresponding  the subpattern, the back reference matches the character string corresponding
1351  to the previous iteration. In order for this to work, the pattern must be such  to the previous iteration. In order for this to work, the pattern must be such
1352  that the first iteration does not need to match the back reference. This can be  that the first iteration does not need to match the back reference. This can be
# Line 1685  with the pattern above. The former gives Line 1720  with the pattern above. The former gives
1720  applied to a whole line of "a" characters, whereas the latter takes an  applied to a whole line of "a" characters, whereas the latter takes an
1721  appreciable time with strings longer than about 20 characters.  appreciable time with strings longer than about 20 characters.
1722    
1723    
1724    .SH UTF-8 SUPPORT
1725    Starting at release 3.3, PCRE has some support for character strings encoded
1726    in the UTF-8 format. This is incomplete, and is regarded as experimental. In
1727    order to use it, you must configure PCRE to include UTF-8 support in the code,
1728    and, in addition, you must call \fBpcre_compile()\fR with the PCRE_UTF8 option
1729    flag. When you do this, both the pattern and any subject strings that are
1730    matched against it are treated as UTF-8 strings instead of just strings of
1731    bytes, but only in the cases that are mentioned below.
1732    
1733    If you compile PCRE with UTF-8 support, but do not use it at run time, the
1734    library will be a bit bigger, but the additional run time overhead is limited
1735    to testing the PCRE_UTF8 flag in several places, so should not be very large.
1736    
1737    PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
1738    not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
1739    the results are undefined.
1740    
1741    Running with PCRE_UTF8 set causes these changes in the way PCRE works:
1742    
1743    1. In a pattern, the escape sequence \\x{...}, where the contents of the braces
1744    is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
1745    code number is the given hexadecimal number, for example: \\x{1234}. This
1746    inserts from one to six literal bytes into the pattern, using the UTF-8
1747    encoding. If a non-hexadecimal digit appears between the braces, the item is
1748    not recognized.
1749    
1750    2. The original hexadecimal escape sequence, \\xhh, generates a two-byte UTF-8
1751    character if its value is greater than 127.
1752    
1753    3. Repeat quantifiers are NOT correctly handled if they follow a multibyte
1754    character. For example, \\x{100}* and \\xc3+ do not work. If you want to
1755    repeat such characters, you must enclose them in non-capturing parentheses,
1756    for example (?:\\x{100}), at present.
1757    
1758    4. The dot metacharacter matches one UTF-8 character instead of a single byte.
1759    
1760    5. Unlike literal UTF-8 characters, the dot metacharacter followed by a
1761    repeat quantifier does operate correctly on UTF-8 characters instead of
1762    single bytes.
1763    
1764    4. Although the \\x{...} escape is permitted in a character class, characters
1765    whose values are greater than 255 cannot be included in a class.
1766    
1767    5. A class is matched against a UTF-8 character instead of just a single byte,
1768    but it can match only characters whose values are less than 256. Characters
1769    with greater values always fail to match a class.
1770    
1771    6. Repeated classes work correctly on multiple characters.
1772    
1773    7. Classes containing just a single character whose value is greater than 127
1774    (but less than 256), for example, [\\x80] or [^\\x{93}], do not work because
1775    these are optimized into single byte matches. In the first case, of course,
1776    the class brackets are just redundant.
1777    
1778    8. Lookbehind assertions move backwards in the subject by a fixed number of
1779    characters instead of a fixed number of bytes. Simple cases have been tested
1780    to work correctly, but there may be hidden gotchas herein.
1781    
1782    9. The character types such as \\d and \\w do not work correctly with UTF-8
1783    characters. They continue to test a single byte.
1784    
1785    10. Anything not explicitly mentioned here continues to work in bytes rather
1786    than in characters.
1787    
1788    The following UTF-8 features of Perl 5.6 are not implemented:
1789    
1790    1. The escape sequence \\C to match a single byte.
1791    
1792    2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X.
1793    
1794  .SH AUTHOR  .SH AUTHOR
1795  Philip Hazel <ph10@cam.ac.uk>  Philip Hazel <ph10@cam.ac.uk>
1796  .br  .br
# Line 1696  Cambridge CB2 3QG, England. Line 1802  Cambridge CB2 3QG, England.
1802  .br  .br
1803  Phone: +44 1223 334714  Phone: +44 1223 334714
1804    
1805  Last updated: 27 January 2000  Last updated: 28 August 2000,
1806    .br
1807      the 250th anniversary of the death of J.S. Bach.
1808  .br  .br
1809  Copyright (c) 1997-2000 University of Cambridge.  Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.47  
changed lines
  Added in v.49

webmaster@exim.org
ViewVC Help
Powered by ViewVC 1.1.12