| 430 |
<P> |
<P> |
| 431 |
Return information about the first character of any matched string, for a |
Return information about the first character of any matched string, for a |
| 432 |
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
| 433 |
such as (cat|cow|coyote), then it is returned in the integer pointed to by |
such as (cat|cow|coyote), it is returned in the integer pointed to by |
| 434 |
<I>where</I>. Otherwise, if either |
<I>where</I>. Otherwise, if either |
| 435 |
</P> |
</P> |
| 436 |
<P> |
<P> |
| 442 |
(if it were set, the pattern would be anchored), |
(if it were set, the pattern would be anchored), |
| 443 |
</P> |
</P> |
| 444 |
<P> |
<P> |
| 445 |
then -1 is returned, indicating that the pattern matches only at the |
-1 is returned, indicating that the pattern matches only at the start of a |
| 446 |
start of a subject string or after any "\n" within the string. Otherwise -2 is |
subject string or after any "\n" within the string. Otherwise -2 is returned. |
| 447 |
returned. For anchored patterns, -2 is returned. |
For anchored patterns, -2 is returned. |
| 448 |
</P> |
</P> |
| 449 |
<P> |
<P> |
| 450 |
<PRE> |
<PRE> |
| 734 |
were captured by the match, including the substring that matched the entire |
were captured by the match, including the substring that matched the entire |
| 735 |
regular expression. This is the value returned by <B>pcre_exec</B> if it |
regular expression. This is the value returned by <B>pcre_exec</B> if it |
| 736 |
is greater than zero. If <B>pcre_exec()</B> returned zero, indicating that it |
is greater than zero. If <B>pcre_exec()</B> returned zero, indicating that it |
| 737 |
ran out of space in <I>ovector</I>, then the value passed as |
ran out of space in <I>ovector</I>, the value passed as <I>stringcount</I> should |
| 738 |
<I>stringcount</I> should be the size of the vector divided by three. |
be the size of the vector divided by three. |
| 739 |
</P> |
</P> |
| 740 |
<P> |
<P> |
| 741 |
The functions <B>pcre_copy_substring()</B> and <B>pcre_get_substring()</B> |
The functions <B>pcre_copy_substring()</B> and <B>pcre_get_substring()</B> |
| 857 |
with the settings of captured strings when part of a pattern is repeated. For |
with the settings of captured strings when part of a pattern is repeated. For |
| 858 |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
| 859 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
| 860 |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set. |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set. |
| 861 |
</P> |
</P> |
| 862 |
<P> |
<P> |
| 863 |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
| 1186 |
<P> |
<P> |
| 1187 |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
| 1188 |
the subject, including a non-printing character, but not (by default) newline. |
the subject, including a non-printing character, but not (by default) newline. |
| 1189 |
If the PCRE_DOTALL option is set, then dots match newlines as well. The |
If the PCRE_DOTALL option is set, dots match newlines as well. The handling of |
| 1190 |
handling of dot is entirely independent of the handling of circumflex and |
dot is entirely independent of the handling of circumflex and dollar, the only |
| 1191 |
dollar, the only relationship being that they both involve newline characters. |
relationship being that they both involve newline characters. Dot has no |
| 1192 |
Dot has no special meaning in a character class. |
special meaning in a character class. |
| 1193 |
</P> |
</P> |
| 1194 |
<LI><A NAME="SEC17" HREF="#TOC1">SQUARE BRACKETS</A> |
<LI><A NAME="SEC17" HREF="#TOC1">SQUARE BRACKETS</A> |
| 1195 |
<P> |
<P> |
| 1580 |
item. |
item. |
| 1581 |
</P> |
</P> |
| 1582 |
<P> |
<P> |
| 1583 |
However, if a quantifier is followed by a question mark, then it ceases to be |
However, if a quantifier is followed by a question mark, it ceases to be |
| 1584 |
greedy, and instead matches the minimum number of times possible, so the |
greedy, and instead matches the minimum number of times possible, so the |
| 1585 |
pattern |
pattern |
| 1586 |
</P> |
</P> |
| 1605 |
way the rest of the pattern matches. |
way the rest of the pattern matches. |
| 1606 |
</P> |
</P> |
| 1607 |
<P> |
<P> |
| 1608 |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl) |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl), |
| 1609 |
then the quantifiers are not greedy by default, but individual ones can be made |
the quantifiers are not greedy by default, but individual ones can be made |
| 1610 |
greedy by following them with a question mark. In other words, it inverts the |
greedy by following them with a question mark. In other words, it inverts the |
| 1611 |
default behaviour. |
default behaviour. |
| 1612 |
</P> |
</P> |
| 1617 |
</P> |
</P> |
| 1618 |
<P> |
<P> |
| 1619 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
| 1620 |
to Perl's /s) is set, thus allowing the . to match newlines, then the pattern |
to Perl's /s) is set, thus allowing the . to match newlines, the pattern is |
| 1621 |
is implicitly anchored, because whatever follows will be tried against every |
implicitly anchored, because whatever follows will be tried against every |
| 1622 |
character position in the subject string, so there is no point in retrying the |
character position in the subject string, so there is no point in retrying the |
| 1623 |
overall match at any position after the first. PCRE treats such a pattern as |
overall match at any position after the first. PCRE treats such a pattern as |
| 1624 |
though it were preceded by \A. In cases where it is known that the subject |
though it were preceded by \A. In cases where it is known that the subject |
| 1677 |
<P> |
<P> |
| 1678 |
matches "sense and sensibility" and "response and responsibility", but not |
matches "sense and sensibility" and "response and responsibility", but not |
| 1679 |
"sense and responsibility". If caseful matching is in force at the time of the |
"sense and responsibility". If caseful matching is in force at the time of the |
| 1680 |
back reference, then the case of letters is relevant. For example, |
back reference, the case of letters is relevant. For example, |
| 1681 |
</P> |
</P> |
| 1682 |
<P> |
<P> |
| 1683 |
<PRE> |
<PRE> |
| 1690 |
</P> |
</P> |
| 1691 |
<P> |
<P> |
| 1692 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
| 1693 |
subpattern has not actually been used in a particular match, then any back |
subpattern has not actually been used in a particular match, any back |
| 1694 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
| 1695 |
</P> |
</P> |
| 1696 |
<P> |
<P> |
| 1702 |
always fails if it starts to match "a" rather than "bc". Because there may be |
always fails if it starts to match "a" rather than "bc". Because there may be |
| 1703 |
up to 99 back references, all digits following the backslash are taken |
up to 99 back references, all digits following the backslash are taken |
| 1704 |
as part of a potential back reference number. If the pattern continues with a |
as part of a potential back reference number. If the pattern continues with a |
| 1705 |
digit character, then some delimiter must be used to terminate the back |
digit character, some delimiter must be used to terminate the back reference. |
| 1706 |
reference. If the PCRE_EXTENDED option is set, this can be whitespace. |
If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty |
| 1707 |
Otherwise an empty comment can be used. |
comment can be used. |
| 1708 |
</P> |
</P> |
| 1709 |
<P> |
<P> |
| 1710 |
A back reference that occurs inside the parentheses to which it refers fails |
A back reference that occurs inside the parentheses to which it refers fails |
| 1836 |
matches "foo" preceded by three digits that are not "999". Notice that each of |
matches "foo" preceded by three digits that are not "999". Notice that each of |
| 1837 |
the assertions is applied independently at the same point in the subject |
the assertions is applied independently at the same point in the subject |
| 1838 |
string. First there is a check that the previous three characters are all |
string. First there is a check that the previous three characters are all |
| 1839 |
digits, then there is a check that the same three characters are not "999". |
digits, and then there is a check that the same three characters are not "999". |
| 1840 |
This pattern does <I>not</I> match "foo" preceded by six characters, the first |
This pattern does <I>not</I> match "foo" preceded by six characters, the first |
| 1841 |
of which are digits and the last three of which are not "999". For example, it |
of which are digits and the last three of which are not "999". For example, it |
| 1842 |
doesn't match "123abcfoo". A pattern to do that is |
doesn't match "123abcfoo". A pattern to do that is |
| 1957 |
</PRE> |
</PRE> |
| 1958 |
</P> |
</P> |
| 1959 |
<P> |
<P> |
| 1960 |
then the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails (because |
| 1961 |
(because there is no following "a"), it backtracks to match all but the last |
there is no following "a"), it backtracks to match all but the last character, |
| 1962 |
character, then all but the last two characters, and so on. Once again the |
then all but the last two characters, and so on. Once again the search for "a" |
| 1963 |
search for "a" covers the entire string, from right to left, so we are no |
covers the entire string, from right to left, so we are no better off. However, |
| 1964 |
better off. However, if the pattern is written as |
if the pattern is written as |
| 1965 |
</P> |
</P> |
| 1966 |
<P> |
<P> |
| 1967 |
<PRE> |
<PRE> |
| 1969 |
</PRE> |
</PRE> |
| 1970 |
</P> |
</P> |
| 1971 |
<P> |
<P> |
| 1972 |
then there can be no backtracking for the .* item; it can match only the entire |
there can be no backtracking for the .* item; it can match only the entire |
| 1973 |
string. The subsequent lookbehind assertion does a single test on the last four |
string. The subsequent lookbehind assertion does a single test on the last four |
| 1974 |
characters. If it fails, the match fails immediately. For long strings, this |
characters. If it fails, the match fails immediately. For long strings, this |
| 1975 |
approach makes a significant difference to the processing time. |
approach makes a significant difference to the processing time. |
| 2032 |
</P> |
</P> |
| 2033 |
<P> |
<P> |
| 2034 |
There are two kinds of condition. If the text between the parentheses consists |
There are two kinds of condition. If the text between the parentheses consists |
| 2035 |
of a sequence of digits, then the condition is satisfied if the capturing |
of a sequence of digits, the condition is satisfied if the capturing subpattern |
| 2036 |
subpattern of that number has previously matched. Consider the following |
of that number has previously matched. Consider the following pattern, which |
| 2037 |
pattern, which contains non-significant white space to make it more readable |
contains non-significant white space to make it more readable (assume the |
| 2038 |
(assume the PCRE_EXTENDED option) and to divide it into three parts for ease |
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: |
|
of discussion: |
|
| 2039 |
</P> |
</P> |
| 2040 |
<P> |
<P> |
| 2041 |
<PRE> |
<PRE> |
| 2156 |
^ ^ |
^ ^ |
| 2157 |
^ ^ |
^ ^ |
| 2158 |
</PRE> |
</PRE> |
| 2159 |
then the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
| 2160 |
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
| 2161 |
has to obtain extra memory to store data during a recursion, which it does by |
has to obtain extra memory to store data during a recursion, which it does by |
| 2162 |
using <B>pcre_malloc</B>, freeing it via <B>pcre_free</B> afterwards. If no |
using <B>pcre_malloc</B>, freeing it via <B>pcre_free</B> afterwards. If no |