| 28 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
| 29 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
| 30 |
|
|
| 31 |
|
void pcre_free_substring(const char *stringptr); |
| 32 |
|
|
| 33 |
|
void pcre_free_substring_list(const char **stringptr); |
| 34 |
|
|
| 35 |
const unsigned char *pcre_maketables(void); |
const unsigned char *pcre_maketables(void); |
| 36 |
|
|
| 37 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
| 52 |
The PCRE library is a set of functions that implement regu- |
The PCRE library is a set of functions that implement regu- |
| 53 |
lar expression pattern matching using the same syntax and |
lar expression pattern matching using the same syntax and |
| 54 |
semantics as Perl 5, with just a few differences (see |
semantics as Perl 5, with just a few differences (see |
| 55 |
|
|
| 56 |
below). The current implementation corresponds to Perl |
below). The current implementation corresponds to Perl |
| 57 |
5.005, with some additional features from the Perl develop- |
5.005, with some additional features from later versions. |
| 58 |
ment release. |
This includes some experimental, incomplete support for |
| 59 |
|
UTF-8 encoded strings. Details of exactly what is and what |
| 60 |
|
is not supported are given below. |
| 61 |
|
|
| 62 |
PCRE has its own native API, which is described in this |
PCRE has its own native API, which is described in this |
| 63 |
document. There is also a set of wrapper functions that |
document. There is also a set of wrapper functions that |
| 74 |
releases. |
releases. |
| 75 |
|
|
| 76 |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
| 77 |
are used for compiling and matching regular expressions, |
are used for compiling and matching regular expressions. |
| 78 |
while pcre_copy_substring(), pcre_get_substring(), and |
|
| 79 |
pcre_get_substring_list() are convenience functions for |
The functions pcre_copy_substring(), pcre_get_substring(), |
| 80 |
|
and pcre_get_substring_list() are convenience functions for |
| 81 |
extracting captured substrings from a matched subject |
extracting captured substrings from a matched subject |
| 82 |
string. The function pcre_maketables() is used (optionally) |
string; pcre_free_substring() and pcre_free_substring_list() |
| 83 |
to build a set of character tables in the current locale for |
are also provided, to free the memory used for extracted |
| 84 |
passing to pcre_compile(). |
strings. |
| 85 |
|
|
| 86 |
|
The function pcre_maketables() is used (optionally) to build |
| 87 |
|
a set of character tables in the current locale for passing |
| 88 |
|
to pcre_compile(). |
| 89 |
|
|
| 90 |
The function pcre_fullinfo() is used to find out information |
The function pcre_fullinfo() is used to find out information |
| 91 |
about a compiled pattern; pcre_info() is an obsolete version |
about a compiled pattern; pcre_info() is an obsolete version |
| 104 |
|
|
| 105 |
|
|
| 106 |
MULTI-THREADING |
MULTI-THREADING |
| 107 |
The PCRE functions can be used in multi-threading applica- |
The PCRE functions can be used in multi-threading |
| 108 |
tions, with the proviso that the memory management functions |
|
| 109 |
pointed to by pcre_malloc and pcre_free are shared by all |
|
| 110 |
threads. |
|
| 111 |
|
|
| 112 |
|
|
| 113 |
|
SunOS 5.8 Last change: 2 |
| 114 |
|
|
| 115 |
|
|
| 116 |
|
|
| 117 |
|
applications, with the proviso that the memory management |
| 118 |
|
functions pointed to by pcre_malloc and pcre_free are shared |
| 119 |
|
by all threads. |
| 120 |
|
|
| 121 |
The compiled form of a regular expression is not altered |
The compiled form of a regular expression is not altered |
| 122 |
during matching, so the same compiled pattern can safely be |
during matching, so the same compiled pattern can safely be |
| 124 |
|
|
| 125 |
|
|
| 126 |
|
|
|
|
|
| 127 |
COMPILING A PATTERN |
COMPILING A PATTERN |
| 128 |
The function pcre_compile() is called to compile a pattern |
The function pcre_compile() is called to compile a pattern |
| 129 |
into an internal form. The pattern is a C string terminated |
into an internal form. The pattern is a C string terminated |
| 255 |
followed by "?". It is not compatible with Perl. It can also |
followed by "?". It is not compatible with Perl. It can also |
| 256 |
be set by a (?U) option setting within the pattern. |
be set by a (?U) option setting within the pattern. |
| 257 |
|
|
| 258 |
|
PCRE_UTF8 |
| 259 |
|
|
| 260 |
|
This option causes PCRE to regard both the pattern and the |
| 261 |
|
subject as strings of UTF-8 characters instead of just byte |
| 262 |
|
strings. However, it is available only if PCRE has been |
| 263 |
|
built to include UTF-8 support. If not, the use of this |
| 264 |
|
option provokes an error. Support for UTF-8 is new, experi- |
| 265 |
|
mental, and incomplete. Details of exactly what it entails |
| 266 |
|
are given below. |
| 267 |
|
|
| 268 |
|
|
| 269 |
|
|
| 270 |
STUDYING A PATTERN |
STUDYING A PATTERN |
| 271 |
When a pattern is going to be used several times, it is |
When a pattern is going to be used several times, it is |
| 272 |
worth spending more time analyzing it in order to speed up |
worth spending more time analyzing it in order to speed up |
| 273 |
the time taken for matching. The function pcre_study() takes |
the time taken for matching. The function pcre_study() takes |
| 274 |
|
|
| 275 |
a pointer to a compiled pattern as its first argument, and |
a pointer to a compiled pattern as its first argument, and |
| 276 |
returns a pointer to a pcre_extra block (another void |
returns a pointer to a pcre_extra block (another void |
| 277 |
typedef) containing additional information about the pat- |
typedef) containing additional information about the pat- |
| 375 |
|
|
| 376 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
| 377 |
|
|
| 378 |
Return the number of the highest back reference in the pat- |
Return the number of the highest back reference in the |
| 379 |
tern. The fourth argument should point to an int variable. |
pattern. The fourth argument should point to an int vari- |
| 380 |
Zero is returned if there are no back references. |
able. Zero is returned if there are no back references. |
| 381 |
|
|
| 382 |
PCRE_INFO_FIRSTCHAR |
PCRE_INFO_FIRSTCHAR |
| 383 |
|
|
| 636 |
|
|
| 637 |
EXTRACTING CAPTURED SUBSTRINGS |
EXTRACTING CAPTURED SUBSTRINGS |
| 638 |
Captured substrings can be accessed directly by using the |
Captured substrings can be accessed directly by using the |
| 639 |
|
|
| 640 |
|
|
| 641 |
|
|
| 642 |
|
|
| 643 |
|
|
| 644 |
|
SunOS 5.8 Last change: 12 |
| 645 |
|
|
| 646 |
|
|
| 647 |
|
|
| 648 |
offsets returned by pcre_exec() in ovector. For convenience, |
offsets returned by pcre_exec() in ovector. For convenience, |
| 649 |
the functions pcre_copy_substring(), pcre_get_substring(), |
the functions pcre_copy_substring(), pcre_get_substring(), |
| 650 |
and pcre_get_substring_list() are provided for extracting |
and pcre_get_substring_list() are provided for extracting |
| 671 |
the entire pattern, while higher values extract the captured |
the entire pattern, while higher values extract the captured |
| 672 |
substrings. For pcre_copy_substring(), the string is placed |
substrings. For pcre_copy_substring(), the string is placed |
| 673 |
in buffer, whose length is given by buffersize, while for |
in buffer, whose length is given by buffersize, while for |
| 674 |
pcre_get_substring() a new block of store is obtained via |
pcre_get_substring() a new block of memory is obtained via |
| 675 |
pcre_malloc, and its address is returned via stringptr. The |
pcre_malloc, and its address is returned via stringptr. The |
| 676 |
yield of the function is the length of the string, not |
yield of the function is the length of the string, not |
| 677 |
including the terminating zero, or one of |
including the terminating zero, or one of |
| 705 |
inspecting the appropriate offset in ovector, which is nega- |
inspecting the appropriate offset in ovector, which is nega- |
| 706 |
tive for unset substrings. |
tive for unset substrings. |
| 707 |
|
|
| 708 |
|
The two convenience functions pcre_free_substring() and |
| 709 |
|
pcre_free_substring_list() can be used to free the memory |
| 710 |
|
returned by a previous call of pcre_get_substring() or |
| 711 |
|
pcre_get_substring_list(), respectively. They do nothing |
| 712 |
|
more than call the function pointed to by pcre_free, which |
| 713 |
|
of course could be called directly from a C program. How- |
| 714 |
|
ever, PCRE is used in some situations where it is linked via |
| 715 |
|
a special interface to another programming language which |
| 716 |
|
cannot use pcre_free directly; it is for these cases that |
| 717 |
|
the functions are provided. |
| 718 |
|
|
| 719 |
|
|
| 720 |
|
|
| 783 |
(?p{code}) constructions. However, there is some experimen- |
(?p{code}) constructions. However, there is some experimen- |
| 784 |
tal support for recursive patterns using the non-Perl item |
tal support for recursive patterns using the non-Perl item |
| 785 |
(?R). |
(?R). |
| 786 |
|
|
| 787 |
8. There are at the time of writing some oddities in Perl |
8. There are at the time of writing some oddities in Perl |
| 788 |
5.005_02 concerned with the settings of captured strings |
5.005_02 concerned with the settings of captured strings |
| 789 |
when part of a pattern is repeated. For example, matching |
when part of a pattern is repeated. For example, matching |
| 836 |
The syntax and semantics of the regular expressions sup- |
The syntax and semantics of the regular expressions sup- |
| 837 |
ported by PCRE are described below. Regular expressions are |
ported by PCRE are described below. Regular expressions are |
| 838 |
also described in the Perl documentation and in a number of |
also described in the Perl documentation and in a number of |
|
|
|
| 839 |
other books, some of which have copious examples. Jeffrey |
other books, some of which have copious examples. Jeffrey |
| 840 |
Friedl's "Mastering Regular Expressions", published by |
Friedl's "Mastering Regular Expressions", published by |
| 841 |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
| 842 |
|
|
| 843 |
The description here is intended as reference documentation. |
The description here is intended as reference documentation. |
| 844 |
|
The basic operation of PCRE is on strings of bytes. However, |
| 845 |
|
there is the beginnings of some support for UTF-8 character |
| 846 |
|
strings. To use this support you must configure PCRE to |
| 847 |
|
include it, and then call pcre_compile() with the PCRE_UTF8 |
| 848 |
|
option. How this affects the pattern matching is described |
| 849 |
|
in the final section of this document. |
| 850 |
|
|
| 851 |
A regular expression is a pattern that is matched against a |
A regular expression is a pattern that is matched against a |
| 852 |
subject string from left to right. Most characters stand for |
subject string from left to right. Most characters stand for |
| 1061 |
Outside a character class, in the default matching mode, the |
Outside a character class, in the default matching mode, the |
| 1062 |
circumflex character is an assertion which is true only if |
circumflex character is an assertion which is true only if |
| 1063 |
the current matching point is at the start of the subject |
the current matching point is at the start of the subject |
| 1064 |
|
|
| 1065 |
string. If the startoffset argument of pcre_exec() is non- |
string. If the startoffset argument of pcre_exec() is non- |
| 1066 |
zero, circumflex can never match. Inside a character class, |
zero, circumflex can never match. Inside a character class, |
| 1067 |
circumflex has an entirely different meaning (see below). |
circumflex has an entirely different meaning (see below). |
| 1114 |
Outside a character class, a dot in the pattern matches any |
Outside a character class, a dot in the pattern matches any |
| 1115 |
one character in the subject, including a non-printing char- |
one character in the subject, including a non-printing char- |
| 1116 |
acter, but not (by default) newline. If the PCRE_DOTALL |
acter, but not (by default) newline. If the PCRE_DOTALL |
| 1117 |
|
|
| 1118 |
option is set, dots match newlines as well. The handling of |
option is set, dots match newlines as well. The handling of |
| 1119 |
dot is entirely independent of the handling of circumflex |
dot is entirely independent of the handling of circumflex |
| 1120 |
and dollar, the only relationship being that they both |
and dollar, the only relationship being that they both |
| 1576 |
A back reference that occurs inside the parentheses to which |
A back reference that occurs inside the parentheses to which |
| 1577 |
it refers fails when the subpattern is first used, so, for |
it refers fails when the subpattern is first used, so, for |
| 1578 |
example, (a\1) never matches. However, such references can |
example, (a\1) never matches. However, such references can |
| 1579 |
be useful inside repeated subpatterns. For example, the |
be useful inside repeated subpatterns. For example, the pat- |
| 1580 |
pattern |
tern |
| 1581 |
|
|
| 1582 |
(a|b\1)+ |
(a|b\1)+ |
| 1583 |
|
|
| 1584 |
matches any number of "a"s and also "aba", "ababaa" etc. At |
matches any number of "a"s and also "aba", "ababbaa" etc. At |
| 1585 |
each iteration of the subpattern, the back reference matches |
each iteration of the subpattern, the back reference matches |
| 1586 |
the character string corresponding to the previous itera- |
the character string corresponding to the previous |
| 1587 |
tion. In order for this to work, the pattern must be such |
iteration. In order for this to work, the pattern must be |
| 1588 |
that the first iteration does not need to match the back |
such that the first iteration does not need to match the |
| 1589 |
reference. This can be done using alternation, as in the |
back reference. This can be done using alternation, as in |
| 1590 |
example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of |
| 1591 |
|
zero. |
| 1592 |
|
|
| 1593 |
|
|
| 1594 |
|
|
| 1741 |
|
|
| 1742 |
This kind of parenthesis "locks up" the part of the pattern |
This kind of parenthesis "locks up" the part of the pattern |
| 1743 |
it contains once it has matched, and a failure further into |
it contains once it has matched, and a failure further into |
| 1744 |
the pattern is prevented from backtracking into it. Back- |
the pattern is prevented from backtracking into it. |
| 1745 |
tracking past it to previous items, however, works as nor- |
Backtracking past it to previous items, however, works as |
| 1746 |
mal. |
normal. |
| 1747 |
|
|
| 1748 |
An alternative description is that a subpattern of this type |
An alternative description is that a subpattern of this type |
| 1749 |
matches the string of characters that an identical stan- |
matches the string of characters that an identical stan- |
| 2001 |
repeat can match 0, 1, 2, 3, or 4 times, and for each of |
repeat can match 0, 1, 2, 3, or 4 times, and for each of |
| 2002 |
those cases other than 0, the + repeats can match different |
those cases other than 0, the + repeats can match different |
| 2003 |
numbers of times.) When the remainder of the pattern is such |
numbers of times.) When the remainder of the pattern is such |
| 2004 |
that the entire match is going to fail, PCRE has in princi- |
that the entire match is going to fail, PCRE has in |
| 2005 |
ple to try every possible variation, and this can take an |
principle to try every possible variation, and this can take |
| 2006 |
extremely long time. |
an extremely long time. |
| 2007 |
|
|
| 2008 |
An optimization catches some of the more simple cases such |
An optimization catches some of the more simple cases such |
| 2009 |
as |
as |
| 2026 |
|
|
| 2027 |
|
|
| 2028 |
|
|
| 2029 |
|
UTF-8 SUPPORT |
| 2030 |
|
Starting at release 3.3, PCRE has some support for character |
| 2031 |
|
strings encoded in the UTF-8 format. This is incomplete, and |
| 2032 |
|
is regarded as experimental. In order to use it, you must |
| 2033 |
|
configure PCRE to include UTF-8 support in the code, and, in |
| 2034 |
|
addition, you must call pcre_compile() with the PCRE_UTF8 |
| 2035 |
|
option flag. When you do this, both the pattern and any sub- |
| 2036 |
|
ject strings that are matched against it are treated as |
| 2037 |
|
UTF-8 strings instead of just strings of bytes, but only in |
| 2038 |
|
the cases that are mentioned below. |
| 2039 |
|
|
| 2040 |
|
If you compile PCRE with UTF-8 support, but do not use it at |
| 2041 |
|
run time, the library will be a bit bigger, but the addi- |
| 2042 |
|
tional run time overhead is limited to testing the PCRE_UTF8 |
| 2043 |
|
flag in several places, so should not be very large. |
| 2044 |
|
|
| 2045 |
|
PCRE assumes that the strings it is given contain valid |
| 2046 |
|
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If |
| 2047 |
|
you pass invalid UTF-8 strings to PCRE, the results are |
| 2048 |
|
undefined. |
| 2049 |
|
|
| 2050 |
|
Running with PCRE_UTF8 set causes these changes in the way |
| 2051 |
|
PCRE works: |
| 2052 |
|
|
| 2053 |
|
1. In a pattern, the escape sequence \x{...}, where the con- |
| 2054 |
|
tents of the braces is a string of hexadecimal digits, is |
| 2055 |
|
interpreted as a UTF-8 character whose code number is the |
| 2056 |
|
given hexadecimal number, for example: \x{1234}. This |
| 2057 |
|
inserts from one to six literal bytes into the pattern, |
| 2058 |
|
using the UTF-8 encoding. If a non-hexadecimal digit appears |
| 2059 |
|
between the braces, the item is not recognized. |
| 2060 |
|
|
| 2061 |
|
2. The original hexadecimal escape sequence, \xhh, generates |
| 2062 |
|
a two-byte UTF-8 character if its value is greater than 127. |
| 2063 |
|
|
| 2064 |
|
3. Repeat quantifiers are NOT correctly handled if they fol- |
| 2065 |
|
low a multibyte character. For example, \x{100}* and \xc3+ |
| 2066 |
|
do not work. If you want to repeat such characters, you must |
| 2067 |
|
enclose them in non-capturing parentheses, for example |
| 2068 |
|
(?:\x{100}), at present. |
| 2069 |
|
|
| 2070 |
|
4. The dot metacharacter matches one UTF-8 character instead |
| 2071 |
|
of a single byte. |
| 2072 |
|
|
| 2073 |
|
5. Unlike literal UTF-8 characters, the dot metacharacter |
| 2074 |
|
followed by a repeat quantifier does operate correctly on |
| 2075 |
|
UTF-8 characters instead of single bytes. |
| 2076 |
|
|
| 2077 |
|
4. Although the \x{...} escape is permitted in a character |
| 2078 |
|
class, characters whose values are greater than 255 cannot |
| 2079 |
|
be included in a class. |
| 2080 |
|
|
| 2081 |
|
5. A class is matched against a UTF-8 character instead of |
| 2082 |
|
just a single byte, but it can match only characters whose |
| 2083 |
|
values are less than 256. Characters with greater values |
| 2084 |
|
always fail to match a class. |
| 2085 |
|
|
| 2086 |
|
6. Repeated classes work correctly on multiple characters. |
| 2087 |
|
|
| 2088 |
|
7. Classes containing just a single character whose value is |
| 2089 |
|
greater than 127 (but less than 256), for example, [\x80] or |
| 2090 |
|
[^\x{93}], do not work because these are optimized into sin- |
| 2091 |
|
gle byte matches. In the first case, of course, the class |
| 2092 |
|
brackets are just redundant. |
| 2093 |
|
|
| 2094 |
|
8. Lookbehind assertions move backwards in the subject by a |
| 2095 |
|
fixed number of characters instead of a fixed number of |
| 2096 |
|
bytes. Simple cases have been tested to work correctly, but |
| 2097 |
|
there may be hidden gotchas herein. |
| 2098 |
|
|
| 2099 |
|
9. The character types such as \d and \w do not work |
| 2100 |
|
correctly with UTF-8 characters. They continue to test a |
| 2101 |
|
single byte. |
| 2102 |
|
|
| 2103 |
|
10. Anything not explicitly mentioned here continues to work |
| 2104 |
|
in bytes rather than in characters. |
| 2105 |
|
|
| 2106 |
|
The following UTF-8 features of Perl 5.6 are not imple- |
| 2107 |
|
mented: |
| 2108 |
|
|
| 2109 |
|
1. The escape sequence \C to match a single byte. |
| 2110 |
|
|
| 2111 |
|
2. The use of Unicode tables and properties and escapes \p, |
| 2112 |
|
\P, and \X. |
| 2113 |
|
|
| 2114 |
|
|
| 2115 |
|
|
| 2116 |
AUTHOR |
AUTHOR |
| 2117 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
| 2118 |
University Computing Service, |
University Computing Service, |
| 2120 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
| 2121 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
| 2122 |
|
|
| 2123 |
Last updated: 27 January 2000 |
Last updated: 28 August 2000, |
| 2124 |
|
the 250th anniversary of the death of J.S. Bach. |
| 2125 |
Copyright (c) 1997-2000 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |