| 400 |
Either of the functions <b>pcre_compile()</b> or <b>pcre_compile2()</b> can be |
Either of the functions <b>pcre_compile()</b> or <b>pcre_compile2()</b> can be |
| 401 |
called to compile a pattern into an internal form. The only difference between |
called to compile a pattern into an internal form. The only difference between |
| 402 |
the two interfaces is that <b>pcre_compile2()</b> has an additional argument, |
the two interfaces is that <b>pcre_compile2()</b> has an additional argument, |
| 403 |
<i>errorcodeptr</i>, via which a numerical error code can be returned. |
<i>errorcodeptr</i>, via which a numerical error code can be returned. To avoid |
| 404 |
|
too much repetition, we refer just to <b>pcre_compile()</b> below, but the |
| 405 |
|
information applies equally to <b>pcre_compile2()</b>. |
| 406 |
</P> |
</P> |
| 407 |
<P> |
<P> |
| 408 |
The pattern is a C string terminated by a binary zero, and is passed in the |
The pattern is a C string terminated by a binary zero, and is passed in the |
| 422 |
The <i>options</i> argument contains various bit settings that affect the |
The <i>options</i> argument contains various bit settings that affect the |
| 423 |
compilation. It should be zero if no options are required. The available |
compilation. It should be zero if no options are required. The available |
| 424 |
options are described below. Some of them (in particular, those that are |
options are described below. Some of them (in particular, those that are |
| 425 |
compatible with Perl, but also some others) can also be set and unset from |
compatible with Perl, but some others as well) can also be set and unset from |
| 426 |
within the pattern (see the detailed description in the |
within the pattern (see the detailed description in the |
| 427 |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
| 428 |
documentation). For those options that can be different in different parts of |
documentation). For those options that can be different in different parts of |
| 429 |
the pattern, the contents of the <i>options</i> argument specifies their initial |
the pattern, the contents of the <i>options</i> argument specifies their |
| 430 |
settings at the start of compilation and execution. The PCRE_ANCHORED and |
settings at the start of compilation and execution. The PCRE_ANCHORED, |
| 431 |
PCRE_NEWLINE_<i>xxx</i> options can be set at the time of matching as well as at |
PCRE_BSR_<i>xxx</i>, and PCRE_NEWLINE_<i>xxx</i> options can be set at the time |
| 432 |
compile time. |
of matching as well as at compile time. |
| 433 |
</P> |
</P> |
| 434 |
<P> |
<P> |
| 435 |
If <i>errptr</i> is NULL, <b>pcre_compile()</b> returns NULL immediately. |
If <i>errptr</i> is NULL, <b>pcre_compile()</b> returns NULL immediately. |
| 437 |
NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual |
NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual |
| 438 |
error message. This is a static string that is part of the library. You must |
error message. This is a static string that is part of the library. You must |
| 439 |
not try to free it. The byte offset from the start of the pattern to the |
not try to free it. The byte offset from the start of the pattern to the |
| 440 |
character that was being processes when the error was discovered is placed in |
character that was being processed when the error was discovered is placed in |
| 441 |
the variable pointed to by <i>erroffset</i>, which must not be NULL. If it is, |
the variable pointed to by <i>erroffset</i>, which must not be NULL. If it is, |
| 442 |
an immediate error is given. Some errors are not detected until checks are |
an immediate error is given. Some errors are not detected until checks are |
| 443 |
carried out when the whole pattern has been scanned; in this case the offset is |
carried out when the whole pattern has been scanned; in this case the offset is |
| 774 |
</P> |
</P> |
| 775 |
<P> |
<P> |
| 776 |
The returned value from <b>pcre_study()</b> can be passed directly to |
The returned value from <b>pcre_study()</b> can be passed directly to |
| 777 |
<b>pcre_exec()</b>. However, a <b>pcre_extra</b> block also contains other |
<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. However, a <b>pcre_extra</b> block |
| 778 |
fields that can be set by the caller before the block is passed; these are |
also contains other fields that can be set by the caller before the block is |
| 779 |
described |
passed; these are described |
| 780 |
<a href="#extradata">below</a> |
<a href="#extradata">below</a> |
| 781 |
in the section on matching a pattern. |
in the section on matching a pattern. |
| 782 |
</P> |
</P> |
| 783 |
<P> |
<P> |
| 784 |
If studying the pattern does not produce any additional information |
If studying the pattern does not produce any useful information, |
| 785 |
<b>pcre_study()</b> returns NULL. In that circumstance, if the calling program |
<b>pcre_study()</b> returns NULL. In that circumstance, if the calling program |
| 786 |
wants to pass any of the other fields to <b>pcre_exec()</b>, it must set up its |
wants to pass any of the other fields to <b>pcre_exec()</b> or |
| 787 |
own <b>pcre_extra</b> block. |
<b>pcre_dfa_exec()</b>, it must set up its own <b>pcre_extra</b> block. |
| 788 |
</P> |
</P> |
| 789 |
<P> |
<P> |
| 790 |
The second argument of <b>pcre_study()</b> contains option bits. At present, no |
The second argument of <b>pcre_study()</b> contains option bits. At present, no |
| 807 |
0, /* no options exist */ |
0, /* no options exist */ |
| 808 |
&error); /* set to NULL or points to a message */ |
&error); /* set to NULL or points to a message */ |
| 809 |
</pre> |
</pre> |
| 810 |
At present, studying a pattern is useful only for non-anchored patterns that do |
Studying a pattern does two things: first, a lower bound for the length of |
| 811 |
not have a single fixed starting character. A bitmap of possible starting |
subject string that is needed to match the pattern is computed. This does not |
| 812 |
bytes is created. |
mean that there are any strings of that length that match, but it does |
| 813 |
|
guarantee that no shorter strings match. The value is used by |
| 814 |
|
<b>pcre_exec()</b> and <b>pcre_dfa_exec()</b> to avoid wasting time by trying to |
| 815 |
|
match strings that are shorter than the lower bound. You can find out the value |
| 816 |
|
in a calling program via the <b>pcre_fullinfo()</b> function. |
| 817 |
|
</P> |
| 818 |
|
<P> |
| 819 |
|
Studying a pattern is also useful for non-anchored patterns that do not have a |
| 820 |
|
single fixed starting character. A bitmap of possible starting bytes is |
| 821 |
|
created. This speeds up finding a position in the subject at which to start |
| 822 |
|
matching. |
| 823 |
<a name="localesupport"></a></P> |
<a name="localesupport"></a></P> |
| 824 |
<br><a name="SEC10" href="#TOC1">LOCALE SUPPORT</a><br> |
<br><a name="SEC10" href="#TOC1">LOCALE SUPPORT</a><br> |
| 825 |
<P> |
<P> |
| 990 |
/^a\d+z\d+/ the returned value is "z", but for /^a\dz\d/ the returned value |
/^a\d+z\d+/ the returned value is "z", but for /^a\dz\d/ the returned value |
| 991 |
is -1. |
is -1. |
| 992 |
<pre> |
<pre> |
| 993 |
|
PCRE_INFO_MINLENGTH |
| 994 |
|
</pre> |
| 995 |
|
If the pattern was studied and a minimum length for matching subject strings |
| 996 |
|
was computed, its value is returned. Otherwise the returned value is -1. The |
| 997 |
|
value is a number of characters, not bytes (this may be relevant in UTF-8 |
| 998 |
|
mode). The fourth argument should point to an <b>int</b> variable. A |
| 999 |
|
non-negative value is a lower bound to the length of any matching string. There |
| 1000 |
|
may not be any strings of that length that do actually match, but every string |
| 1001 |
|
that does match is at least that long. |
| 1002 |
|
<pre> |
| 1003 |
PCRE_INFO_NAMECOUNT |
PCRE_INFO_NAMECOUNT |
| 1004 |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMEENTRYSIZE |
| 1005 |
PCRE_INFO_NAMETABLE |
PCRE_INFO_NAMETABLE |
| 1021 |
length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first |
length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first |
| 1022 |
entry of the table (a pointer to <b>char</b>). The first two bytes of each entry |
entry of the table (a pointer to <b>char</b>). The first two bytes of each entry |
| 1023 |
are the number of the capturing parenthesis, most significant byte first. The |
are the number of the capturing parenthesis, most significant byte first. The |
| 1024 |
rest of the entry is the corresponding name, zero terminated. The names are in |
rest of the entry is the corresponding name, zero terminated. |
| 1025 |
alphabetical order. When PCRE_DUPNAMES is set, duplicate names are in order of |
</P> |
| 1026 |
their parentheses numbers. For example, consider the following pattern (assume |
<P> |
| 1027 |
PCRE_EXTENDED is set, so white space - including newlines - is ignored): |
The names are in alphabetical order. Duplicate names may appear if (?| is used |
| 1028 |
|
to create multiple groups with the same number, as described in the |
| 1029 |
|
<a href="pcrepattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a> |
| 1030 |
|
in the |
| 1031 |
|
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
| 1032 |
|
page. Duplicate names for subpatterns with different numbers are permitted only |
| 1033 |
|
if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the |
| 1034 |
|
table in the order in which they were found in the pattern. In the absence of |
| 1035 |
|
(?| this is the order of increasing number; when (?| is used this is not |
| 1036 |
|
necessarily the case because later subpatterns may have lower numbers. |
| 1037 |
|
</P> |
| 1038 |
|
<P> |
| 1039 |
|
As a simple example of the name/number table, consider the following pattern |
| 1040 |
|
(assume PCRE_EXTENDED is set, so white space - including newlines - is |
| 1041 |
|
ignored): |
| 1042 |
<pre> |
<pre> |
| 1043 |
(?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) ) |
(?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) ) |
| 1044 |
</pre> |
</pre> |
| 1098 |
Return the size of the data block pointed to by the <i>study_data</i> field in |
Return the size of the data block pointed to by the <i>study_data</i> field in |
| 1099 |
a <b>pcre_extra</b> block. That is, it is the value that was passed to |
a <b>pcre_extra</b> block. That is, it is the value that was passed to |
| 1100 |
<b>pcre_malloc()</b> when PCRE was getting memory into which to place the data |
<b>pcre_malloc()</b> when PCRE was getting memory into which to place the data |
| 1101 |
created by <b>pcre_study()</b>. The fourth argument should point to a |
created by <b>pcre_study()</b>. If <b>pcre_extra</b> is NULL, or there is no |
| 1102 |
|
study data, zero is returned. The fourth argument should point to a |
| 1103 |
<b>size_t</b> variable. |
<b>size_t</b> variable. |
| 1104 |
</P> |
</P> |
| 1105 |
<br><a name="SEC12" href="#TOC1">OBSOLETE INFO FUNCTION</a><br> |
<br><a name="SEC12" href="#TOC1">OBSOLETE INFO FUNCTION</a><br> |
| 1159 |
<P> |
<P> |
| 1160 |
The function <b>pcre_exec()</b> is called to match a subject string against a |
The function <b>pcre_exec()</b> is called to match a subject string against a |
| 1161 |
compiled pattern, which is passed in the <i>code</i> argument. If the |
compiled pattern, which is passed in the <i>code</i> argument. If the |
| 1162 |
pattern has been studied, the result of the study should be passed in the |
pattern was studied, the result of the study should be passed in the |
| 1163 |
<i>extra</i> argument. This function is the main matching facility of the |
<i>extra</i> argument. This function is the main matching facility of the |
| 1164 |
library, and it operates in a Perl-like manner. For specialist use there is |
library, and it operates in a Perl-like manner. For specialist use there is |
| 1165 |
also an alternative matching function, which is described |
also an alternative matching function, which is described |
| 1226 |
The <i>match_limit</i> field provides a means of preventing PCRE from using up a |
The <i>match_limit</i> field provides a means of preventing PCRE from using up a |
| 1227 |
vast amount of resources when running patterns that are not going to match, |
vast amount of resources when running patterns that are not going to match, |
| 1228 |
but which have a very large number of possibilities in their search trees. The |
but which have a very large number of possibilities in their search trees. The |
| 1229 |
classic example is the use of nested unlimited repeats. |
classic example is a pattern that uses nested unlimited repeats. |
| 1230 |
</P> |
</P> |
| 1231 |
<P> |
<P> |
| 1232 |
Internally, PCRE uses a function called <b>match()</b> which it calls repeatedly |
Internally, PCRE uses a function called <b>match()</b> which it calls repeatedly |
| 1376 |
<pre> |
<pre> |
| 1377 |
PCRE_NOTEMPTY_ATSTART |
PCRE_NOTEMPTY_ATSTART |
| 1378 |
</pre> |
</pre> |
| 1379 |
This is like PCRE_NOTEMPTY, except that an empty string match that is not at |
This is like PCRE_NOTEMPTY, except that an empty string match that is not at |
| 1380 |
the start of the subject is permitted. If the pattern is anchored, such a match |
the start of the subject is permitted. If the pattern is anchored, such a match |
| 1381 |
can occur only if the pattern contains \K. |
can occur only if the pattern contains \K. |
| 1382 |
</P> |
</P> |
| 1427 |
subject, or a value of <i>startoffset</i> that does not point to the start of a |
subject, or a value of <i>startoffset</i> that does not point to the start of a |
| 1428 |
UTF-8 character, is undefined. Your program may crash. |
UTF-8 character, is undefined. Your program may crash. |
| 1429 |
<pre> |
<pre> |
| 1430 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
| 1431 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
| 1432 |
</pre> |
</pre> |
| 1433 |
These options turn on the partial matching feature. For backwards |
These options turn on the partial matching feature. For backwards |
| 1536 |
advisable to supply an <i>ovector</i>. |
advisable to supply an <i>ovector</i>. |
| 1537 |
</P> |
</P> |
| 1538 |
<P> |
<P> |
| 1539 |
The <b>pcre_info()</b> function can be used to find out how many capturing |
The <b>pcre_fullinfo()</b> function can be used to find out how many capturing |
| 1540 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
| 1541 |
<i>ovector</i> that will allow for <i>n</i> captured substrings, in addition to |
<i>ovector</i> that will allow for <i>n</i> captured substrings, in addition to |
| 1542 |
the offsets of the substring matched by the whole pattern, is (<i>n</i>+1)*3. |
the offsets of the substring matched by the whole pattern, is (<i>n</i>+1)*3. |
| 1642 |
</pre> |
</pre> |
| 1643 |
This code is no longer in use. It was formerly returned when the PCRE_PARTIAL |
This code is no longer in use. It was formerly returned when the PCRE_PARTIAL |
| 1644 |
option was used with a compiled pattern containing items that were not |
option was used with a compiled pattern containing items that were not |
| 1645 |
supported for partial matching. From release 8.00 onwards, there are no |
supported for partial matching. From release 8.00 onwards, there are no |
| 1646 |
restrictions on partial matching. |
restrictions on partial matching. |
| 1647 |
<pre> |
<pre> |
| 1648 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
| 1816 |
the behaviour may not be what you want (see the next section). |
the behaviour may not be what you want (see the next section). |
| 1817 |
</P> |
</P> |
| 1818 |
<P> |
<P> |
| 1819 |
<b>Warning:</b> If the pattern uses the "(?|" feature to set up multiple |
<b>Warning:</b> If the pattern uses the (?| feature to set up multiple |
| 1820 |
subpatterns with the same number, you cannot use names to distinguish them, |
subpatterns with the same number, as described in the |
| 1821 |
because names are not included in the compiled code. The matching process uses |
<a href="pcrepattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a> |
| 1822 |
only numbers. |
in the |
| 1823 |
|
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
| 1824 |
|
page, you cannot use names to distinguish the different subpatterns, because |
| 1825 |
|
names are not included in the compiled code. The matching process uses only |
| 1826 |
|
numbers. For this reason, the use of different names for subpatterns of the |
| 1827 |
|
same number causes an error at compile time. |
| 1828 |
</P> |
</P> |
| 1829 |
<br><a name="SEC17" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br> |
<br><a name="SEC17" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br> |
| 1830 |
<P> |
<P> |
| 1833 |
</P> |
</P> |
| 1834 |
<P> |
<P> |
| 1835 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns |
When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns |
| 1836 |
are not required to be unique. Normally, patterns with duplicate names are such |
are not required to be unique. (Duplicate names are always allowed for |
| 1837 |
that in any one match, only one of the named subpatterns participates. An |
subpatterns with the same number, created by using the (?| feature. Indeed, if |
| 1838 |
example is shown in the |
such subpatterns are named, they are required to use the same names.) |
| 1839 |
|
</P> |
| 1840 |
|
<P> |
| 1841 |
|
Normally, patterns with duplicate names are such that in any one match, only |
| 1842 |
|
one of the named subpatterns participates. An example is shown in the |
| 1843 |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
| 1844 |
documentation. |
documentation. |
| 1845 |
</P> |
</P> |
| 1895 |
just once, and does not backtrack. This has different characteristics to the |
just once, and does not backtrack. This has different characteristics to the |
| 1896 |
normal algorithm, and is not compatible with Perl. Some of the features of PCRE |
normal algorithm, and is not compatible with Perl. Some of the features of PCRE |
| 1897 |
patterns are not supported. Nevertheless, there are times when this kind of |
patterns are not supported. Nevertheless, there are times when this kind of |
| 1898 |
matching can be useful. For a discussion of the two matching algorithms, and a |
matching can be useful. For a discussion of the two matching algorithms, and a |
| 1899 |
list of features that <b>pcre_dfa_exec()</b> does not support, see the |
list of features that <b>pcre_dfa_exec()</b> does not support, see the |
| 1900 |
<a href="pcrematching.html"><b>pcrematching</b></a> |
<a href="pcrematching.html"><b>pcrematching</b></a> |
| 1901 |
documentation. |
documentation. |
| 1944 |
for <b>pcre_exec()</b>, so their description is not repeated here. |
for <b>pcre_exec()</b>, so their description is not repeated here. |
| 1945 |
<pre> |
<pre> |
| 1946 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
| 1947 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
| 1948 |
</pre> |
</pre> |
| 1949 |
These have the same general effect as they do for <b>pcre_exec()</b>, but the |
These have the same general effect as they do for <b>pcre_exec()</b>, but the |
| 1950 |
details are slightly different. When PCRE_PARTIAL_HARD is set for |
details are slightly different. When PCRE_PARTIAL_HARD is set for |
| 2067 |
</P> |
</P> |
| 2068 |
<br><a name="SEC22" href="#TOC1">REVISION</a><br> |
<br><a name="SEC22" href="#TOC1">REVISION</a><br> |
| 2069 |
<P> |
<P> |
| 2070 |
Last updated: 22 September 2009 |
Last updated: 03 October 2009 |
| 2071 |
<br> |
<br> |
| 2072 |
Copyright © 1997-2009 University of Cambridge. |
Copyright © 1997-2009 University of Cambridge. |
| 2073 |
<br> |
<br> |