| 1156 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
| 1157 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
| 1158 |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
| 1159 |
try to free it. The offset from the start of the pattern to the charac- |
try to free it. The byte offset from the start of the pattern to the |
| 1160 |
ter where the error was discovered is placed in the variable pointed to |
character that was being processes when the error was discovered is |
| 1161 |
by erroffset, which must not be NULL. If it is, an immediate error is |
placed in the variable pointed to by erroffset, which must not be NULL. |
| 1162 |
given. |
If it is, an immediate error is given. Some errors are not detected |
| 1163 |
|
until checks are carried out when the whole pattern has been scanned; |
| 1164 |
|
in this case the offset is set to the end of the pattern. |
| 1165 |
|
|
| 1166 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
| 1167 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
| 2668 |
|
|
| 2669 |
REVISION |
REVISION |
| 2670 |
|
|
| 2671 |
Last updated: 11 September 2009 |
Last updated: 22 September 2009 |
| 2672 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 2673 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 2674 |
|
|
| 4485 |
|
|
| 4486 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
| 4487 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
| 4488 |
This is an extension compared with Perl (at least for 5.8), which |
This is an extension compared with Perl (5.8 and 5.10), which requires |
| 4489 |
requires all branches to match the same length of string. An assertion |
all branches to match the same length of string. An assertion such as |
|
such as |
|
| 4490 |
|
|
| 4491 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
| 4492 |
|
|
| 4493 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
| 4494 |
different lengths, but it is acceptable if rewritten to use two top- |
different lengths, but it is acceptable to PCRE if rewritten to use two |
| 4495 |
level branches: |
top-level branches: |
| 4496 |
|
|
| 4497 |
(?<=abc|abde) |
(?<=abc|abde) |
| 4498 |
|
|
| 4499 |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
| 4500 |
instead of a lookbehind assertion; this is not restricted to a fixed- |
instead of a lookbehind assertion to get round the fixed-length |
| 4501 |
length. |
restriction. |
| 4502 |
|
|
| 4503 |
The implementation of lookbehind assertions is, for each alternative, |
The implementation of lookbehind assertions is, for each alternative, |
| 4504 |
to temporarily move the current position back by the fixed length and |
to temporarily move the current position back by the fixed length and |
| 4505 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
| 4506 |
rent position, the assertion fails. |
rent position, the assertion fails. |
| 4507 |
|
|
| 4508 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
| 4509 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
mode) to appear in lookbehind assertions, because it makes it impossi- |
| 4510 |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
| 4511 |
which can match different numbers of bytes, are also not permitted. |
which can match different numbers of bytes, are also not permitted. |
| 4512 |
|
|
| 4513 |
Possessive quantifiers can be used in conjunction with lookbehind |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
| 4514 |
assertions to specify efficient matching at the end of the subject |
lookbehinds, as long as the subpattern matches a fixed-length string. |
| 4515 |
|
Recursion, however, is not supported. |
| 4516 |
|
|
| 4517 |
|
Possessive quantifiers can be used in conjunction with lookbehind |
| 4518 |
|
assertions to specify efficient matching at the end of the subject |
| 4519 |
string. Consider a simple pattern such as |
string. Consider a simple pattern such as |
| 4520 |
|
|
| 4521 |
abcd$ |
abcd$ |
| 4522 |
|
|
| 4523 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
| 4524 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
| 4525 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
| 4526 |
pattern is specified as |
pattern is specified as |
| 4527 |
|
|
| 4528 |
^.*abcd$ |
^.*abcd$ |
| 4529 |
|
|
| 4530 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
| 4531 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
| 4532 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
| 4533 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
| 4534 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
| 4535 |
|
|
| 4536 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
| 4537 |
|
|
| 4538 |
there can be no backtracking for the .*+ item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
| 4539 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
| 4540 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
| 4541 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
| 4542 |
processing time. |
processing time. |
| 4543 |
|
|
| 4544 |
Using multiple assertions |
Using multiple assertions |
| 4547 |
|
|
| 4548 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
| 4549 |
|
|
| 4550 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
| 4551 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
| 4552 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
| 4553 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
| 4554 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
| 4555 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
| 4556 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
| 4557 |
foo". A pattern to do that is |
foo". A pattern to do that is |
| 4558 |
|
|
| 4559 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
| 4560 |
|
|
| 4561 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
| 4562 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
| 4563 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
| 4564 |
|
|
| 4566 |
|
|
| 4567 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
| 4568 |
|
|
| 4569 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
| 4570 |
is not preceded by "foo", while |
is not preceded by "foo", while |
| 4571 |
|
|
| 4572 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
| 4573 |
|
|
| 4574 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
| 4575 |
three characters that are not "999". |
three characters that are not "999". |
| 4576 |
|
|
| 4577 |
|
|
| 4578 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
| 4579 |
|
|
| 4580 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
| 4581 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
| 4582 |
on the result of an assertion, or whether a previous capturing subpat- |
on the result of an assertion, or whether a previous capturing subpat- |
| 4583 |
tern matched or not. The two possible forms of conditional subpattern |
tern matched or not. The two possible forms of conditional subpattern |
| 4584 |
are |
are |
| 4585 |
|
|
| 4586 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
| 4587 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
| 4588 |
|
|
| 4589 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
| 4590 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
| 4591 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
| 4592 |
|
|
| 4593 |
There are four kinds of condition: references to subpatterns, refer- |
There are four kinds of condition: references to subpatterns, refer- |
| 4594 |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
| 4595 |
|
|
| 4596 |
Checking for a used subpattern by number |
Checking for a used subpattern by number |
| 4597 |
|
|
| 4598 |
If the text between the parentheses consists of a sequence of digits, |
If the text between the parentheses consists of a sequence of digits, |
| 4599 |
the condition is true if the capturing subpattern of that number has |
the condition is true if the capturing subpattern of that number has |
| 4600 |
previously matched. An alternative notation is to precede the digits |
previously matched. An alternative notation is to precede the digits |
| 4601 |
with a plus or minus sign. In this case, the subpattern number is rela- |
with a plus or minus sign. In this case, the subpattern number is rela- |
| 4602 |
tive rather than absolute. The most recently opened parentheses can be |
tive rather than absolute. The most recently opened parentheses can be |
| 4603 |
referenced by (?(-1), the next most recent by (?(-2), and so on. In |
referenced by (?(-1), the next most recent by (?(-2), and so on. In |
| 4604 |
looping constructs it can also make sense to refer to subsequent groups |
looping constructs it can also make sense to refer to subsequent groups |
| 4605 |
with constructs such as (?(+2). |
with constructs such as (?(+2). |
| 4606 |
|
|
| 4607 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
| 4608 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
| 4609 |
divide it into three parts for ease of discussion: |
divide it into three parts for ease of discussion: |
| 4610 |
|
|
| 4611 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
| 4612 |
|
|
| 4613 |
The first part matches an optional opening parenthesis, and if that |
The first part matches an optional opening parenthesis, and if that |
| 4614 |
character is present, sets it as the first captured substring. The sec- |
character is present, sets it as the first captured substring. The sec- |
| 4615 |
ond part matches one or more characters that are not parentheses. The |
ond part matches one or more characters that are not parentheses. The |
| 4616 |
third part is a conditional subpattern that tests whether the first set |
third part is a conditional subpattern that tests whether the first set |
| 4617 |
of parentheses matched or not. If they did, that is, if subject started |
of parentheses matched or not. If they did, that is, if subject started |
| 4618 |
with an opening parenthesis, the condition is true, and so the yes-pat- |
with an opening parenthesis, the condition is true, and so the yes-pat- |
| 4619 |
tern is executed and a closing parenthesis is required. Otherwise, |
tern is executed and a closing parenthesis is required. Otherwise, |
| 4620 |
since no-pattern is not present, the subpattern matches nothing. In |
since no-pattern is not present, the subpattern matches nothing. In |
| 4621 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
| 4622 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. |
| 4623 |
|
|
| 4624 |
If you were embedding this pattern in a larger one, you could use a |
If you were embedding this pattern in a larger one, you could use a |
| 4625 |
relative reference: |
relative reference: |
| 4626 |
|
|
| 4627 |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
| 4628 |
|
|
| 4629 |
This makes the fragment independent of the parentheses in the larger |
This makes the fragment independent of the parentheses in the larger |
| 4630 |
pattern. |
pattern. |
| 4631 |
|
|
| 4632 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
| 4633 |
|
|
| 4634 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
| 4635 |
used subpattern by name. For compatibility with earlier versions of |
used subpattern by name. For compatibility with earlier versions of |
| 4636 |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
| 4637 |
also recognized. However, there is a possible ambiguity with this syn- |
also recognized. However, there is a possible ambiguity with this syn- |
| 4638 |
tax, because subpattern names may consist entirely of digits. PCRE |
tax, because subpattern names may consist entirely of digits. PCRE |
| 4639 |
looks first for a named subpattern; if it cannot find one and the name |
looks first for a named subpattern; if it cannot find one and the name |
| 4640 |
consists entirely of digits, PCRE looks for a subpattern of that num- |
consists entirely of digits, PCRE looks for a subpattern of that num- |
| 4641 |
ber, which must be greater than zero. Using subpattern names that con- |
ber, which must be greater than zero. Using subpattern names that con- |
| 4642 |
sist entirely of digits is not recommended. |
sist entirely of digits is not recommended. |
| 4643 |
|
|
| 4644 |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
| 4649 |
Checking for pattern recursion |
Checking for pattern recursion |
| 4650 |
|
|
| 4651 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
| 4652 |
name R, the condition is true if a recursive call to the whole pattern |
name R, the condition is true if a recursive call to the whole pattern |
| 4653 |
or any subpattern has been made. If digits or a name preceded by amper- |
or any subpattern has been made. If digits or a name preceded by amper- |
| 4654 |
sand follow the letter R, for example: |
sand follow the letter R, for example: |
| 4655 |
|
|
| 4656 |
(?(R3)...) or (?(R&name)...) |
(?(R3)...) or (?(R&name)...) |
| 4657 |
|
|
| 4658 |
the condition is true if the most recent recursion is into the subpat- |
the condition is true if the most recent recursion is into the subpat- |
| 4659 |
tern whose number or name is given. This condition does not check the |
tern whose number or name is given. This condition does not check the |
| 4660 |
entire recursion stack. |
entire recursion stack. |
| 4661 |
|
|
| 4662 |
At "top level", all these recursion test conditions are false. Recur- |
At "top level", all these recursion test conditions are false. Recur- |
| 4663 |
sive patterns are described below. |
sive patterns are described below. |
| 4664 |
|
|
| 4665 |
Defining subpatterns for use by reference only |
Defining subpatterns for use by reference only |
| 4666 |
|
|
| 4667 |
If the condition is the string (DEFINE), and there is no subpattern |
If the condition is the string (DEFINE), and there is no subpattern |
| 4668 |
with the name DEFINE, the condition is always false. In this case, |
with the name DEFINE, the condition is always false. In this case, |
| 4669 |
there may be only one alternative in the subpattern. It is always |
there may be only one alternative in the subpattern. It is always |
| 4670 |
skipped if control reaches this point in the pattern; the idea of |
skipped if control reaches this point in the pattern; the idea of |
| 4671 |
DEFINE is that it can be used to define "subroutines" that can be ref- |
DEFINE is that it can be used to define "subroutines" that can be ref- |
| 4672 |
erenced from elsewhere. (The use of "subroutines" is described below.) |
erenced from elsewhere. (The use of "subroutines" is described below.) |
| 4673 |
For example, a pattern to match an IPv4 address could be written like |
For example, a pattern to match an IPv4 address could be written like |
| 4674 |
this (ignore whitespace and line breaks): |
this (ignore whitespace and line breaks): |
| 4675 |
|
|
| 4676 |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| 4677 |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
| 4678 |
|
|
| 4679 |
The first part of the pattern is a DEFINE group inside which a another |
The first part of the pattern is a DEFINE group inside which a another |
| 4680 |
group named "byte" is defined. This matches an individual component of |
group named "byte" is defined. This matches an individual component of |
| 4681 |
an IPv4 address (a number less than 256). When matching takes place, |
an IPv4 address (a number less than 256). When matching takes place, |
| 4682 |
this part of the pattern is skipped because DEFINE acts like a false |
this part of the pattern is skipped because DEFINE acts like a false |
| 4683 |
condition. |
condition. |
| 4684 |
|
|
| 4685 |
The rest of the pattern uses references to the named group to match the |
The rest of the pattern uses references to the named group to match the |
| 4686 |
four dot-separated components of an IPv4 address, insisting on a word |
four dot-separated components of an IPv4 address, insisting on a word |
| 4687 |
boundary at each end. |
boundary at each end. |
| 4688 |
|
|
| 4689 |
Assertion conditions |
Assertion conditions |
| 4690 |
|
|
| 4691 |
If the condition is not in any of the above formats, it must be an |
If the condition is not in any of the above formats, it must be an |
| 4692 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
| 4693 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
| 4694 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
| 4695 |
|
|
| 4696 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
| 4697 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
| 4698 |
|
|
| 4699 |
The condition is a positive lookahead assertion that matches an |
The condition is a positive lookahead assertion that matches an |
| 4700 |
optional sequence of non-letters followed by a letter. In other words, |
optional sequence of non-letters followed by a letter. In other words, |
| 4701 |
it tests for the presence of at least one letter in the subject. If a |
it tests for the presence of at least one letter in the subject. If a |
| 4702 |
letter is found, the subject is matched against the first alternative; |
letter is found, the subject is matched against the first alternative; |
| 4703 |
otherwise it is matched against the second. This pattern matches |
otherwise it is matched against the second. This pattern matches |
| 4704 |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
| 4705 |
letters and dd are digits. |
letters and dd are digits. |
| 4706 |
|
|
| 4707 |
|
|
| 4708 |
COMMENTS |
COMMENTS |
| 4709 |
|
|
| 4710 |
The sequence (?# marks the start of a comment that continues up to the |
The sequence (?# marks the start of a comment that continues up to the |
| 4711 |
next closing parenthesis. Nested parentheses are not permitted. The |
next closing parenthesis. Nested parentheses are not permitted. The |
| 4712 |
characters that make up a comment play no part in the pattern matching |
characters that make up a comment play no part in the pattern matching |
| 4713 |
at all. |
at all. |
| 4714 |
|
|
| 4715 |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
| 4716 |
character class introduces a comment that continues to immediately |
character class introduces a comment that continues to immediately |
| 4717 |
after the next newline in the pattern. |
after the next newline in the pattern. |
| 4718 |
|
|
| 4719 |
|
|
| 4720 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
| 4721 |
|
|
| 4722 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
| 4723 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
| 4724 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
| 4725 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
| 4726 |
depth. |
depth. |
| 4727 |
|
|
| 4728 |
For some time, Perl has provided a facility that allows regular expres- |
For some time, Perl has provided a facility that allows regular expres- |
| 4729 |
sions to recurse (amongst other things). It does this by interpolating |
sions to recurse (amongst other things). It does this by interpolating |
| 4730 |
Perl code in the expression at run time, and the code can refer to the |
Perl code in the expression at run time, and the code can refer to the |
| 4731 |
expression itself. A Perl pattern using code interpolation to solve the |
expression itself. A Perl pattern using code interpolation to solve the |
| 4732 |
parentheses problem can be created like this: |
parentheses problem can be created like this: |
| 4733 |
|
|
| 4737 |
refers recursively to the pattern in which it appears. |
refers recursively to the pattern in which it appears. |
| 4738 |
|
|
| 4739 |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
| 4740 |
it supports special syntax for recursion of the entire pattern, and |
it supports special syntax for recursion of the entire pattern, and |
| 4741 |
also for individual subpattern recursion. After its introduction in |
also for individual subpattern recursion. After its introduction in |
| 4742 |
PCRE and Python, this kind of recursion was subsequently introduced |
PCRE and Python, this kind of recursion was subsequently introduced |
| 4743 |
into Perl at release 5.10. |
into Perl at release 5.10. |
| 4744 |
|
|
| 4745 |
A special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
| 4746 |
zero and a closing parenthesis is a recursive call of the subpattern of |
zero and a closing parenthesis is a recursive call of the subpattern of |
| 4747 |
the given number, provided that it occurs inside that subpattern. (If |
the given number, provided that it occurs inside that subpattern. (If |
| 4748 |
not, it is a "subroutine" call, which is described in the next sec- |
not, it is a "subroutine" call, which is described in the next sec- |
| 4749 |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
| 4750 |
regular expression. |
regular expression. |
| 4751 |
|
|
| 4752 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
| 4753 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
| 4754 |
|
|
| 4755 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
| 4756 |
|
|
| 4757 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
| 4758 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
| 4759 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
| 4760 |
sized substring). Finally there is a closing parenthesis. |
sized substring). Finally there is a closing parenthesis. |
| 4761 |
|
|
| 4762 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
| 4763 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
| 4764 |
|
|
| 4765 |
( \( ( (?>[^()]+) | (?1) )* \) ) |
( \( ( (?>[^()]+) | (?1) )* \) ) |
| 4766 |
|
|
| 4767 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
| 4768 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
| 4769 |
|
|
| 4770 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
| 4771 |
tricky. This is made easier by the use of relative references. (A Perl |
tricky. This is made easier by the use of relative references. (A Perl |
| 4772 |
5.10 feature.) Instead of (?1) in the pattern above you can write |
5.10 feature.) Instead of (?1) in the pattern above you can write |
| 4773 |
(?-2) to refer to the second most recently opened parentheses preceding |
(?-2) to refer to the second most recently opened parentheses preceding |
| 4774 |
the recursion. In other words, a negative number counts capturing |
the recursion. In other words, a negative number counts capturing |
| 4775 |
parentheses leftwards from the point at which it is encountered. |
parentheses leftwards from the point at which it is encountered. |
| 4776 |
|
|
| 4777 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
| 4778 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
| 4779 |
because the reference is not inside the parentheses that are refer- |
because the reference is not inside the parentheses that are refer- |
| 4780 |
enced. They are always "subroutine" calls, as described in the next |
enced. They are always "subroutine" calls, as described in the next |
| 4781 |
section. |
section. |
| 4782 |
|
|
| 4783 |
An alternative approach is to use named parentheses instead. The Perl |
An alternative approach is to use named parentheses instead. The Perl |
| 4784 |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
| 4785 |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
| 4786 |
|
|
| 4787 |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
| 4788 |
|
|
| 4789 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
| 4790 |
one is used. |
one is used. |
| 4791 |
|
|
| 4792 |
This particular example pattern that we have been looking at contains |
This particular example pattern that we have been looking at contains |
| 4793 |
nested unlimited repeats, and so the use of atomic grouping for match- |
nested unlimited repeats, and so the use of atomic grouping for match- |
| 4794 |
ing strings of non-parentheses is important when applying the pattern |
ing strings of non-parentheses is important when applying the pattern |
| 4795 |
to strings that do not match. For example, when this pattern is applied |
to strings that do not match. For example, when this pattern is applied |
| 4796 |
to |
to |
| 4797 |
|
|
| 4798 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| 4799 |
|
|
| 4800 |
it yields "no match" quickly. However, if atomic grouping is not used, |
it yields "no match" quickly. However, if atomic grouping is not used, |
| 4801 |
the match runs for a very long time indeed because there are so many |
the match runs for a very long time indeed because there are so many |
| 4802 |
different ways the + and * repeats can carve up the subject, and all |
different ways the + and * repeats can carve up the subject, and all |
| 4803 |
have to be tested before failure can be reported. |
have to be tested before failure can be reported. |
| 4804 |
|
|
| 4805 |
At the end of a match, the values set for any capturing subpatterns are |
At the end of a match, the values set for any capturing subpatterns are |
| 4806 |
those from the outermost level of the recursion at which the subpattern |
those from the outermost level of the recursion at which the subpattern |
| 4807 |
value is set. If you want to obtain intermediate values, a callout |
value is set. If you want to obtain intermediate values, a callout |
| 4808 |
function can be used (see below and the pcrecallout documentation). If |
function can be used (see below and the pcrecallout documentation). If |
| 4809 |
the pattern above is matched against |
the pattern above is matched against |
| 4810 |
|
|
| 4811 |
(ab(cd)ef) |
(ab(cd)ef) |
| 4812 |
|
|
| 4813 |
the value for the capturing parentheses is "ef", which is the last |
the value for the capturing parentheses is "ef", which is the last |
| 4814 |
value taken on at the top level. If additional parentheses are added, |
value taken on at the top level. If additional parentheses are added, |
| 4815 |
giving |
giving |
| 4816 |
|
|
| 4817 |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
| 4818 |
^ ^ |
^ ^ |
| 4819 |
^ ^ |
^ ^ |
| 4820 |
|
|
| 4821 |
the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
| 4822 |
parentheses. If there are more than 15 capturing parentheses in a pat- |
parentheses. If there are more than 15 capturing parentheses in a pat- |
| 4823 |
tern, PCRE has to obtain extra memory to store data during a recursion, |
tern, PCRE has to obtain extra memory to store data during a recursion, |
| 4824 |
which it does by using pcre_malloc, freeing it via pcre_free after- |
which it does by using pcre_malloc, freeing it via pcre_free after- |
| 4825 |
wards. If no memory can be obtained, the match fails with the |
wards. If no memory can be obtained, the match fails with the |
| 4826 |
PCRE_ERROR_NOMEMORY error. |
PCRE_ERROR_NOMEMORY error. |
| 4827 |
|
|
| 4828 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
| 4829 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
| 4830 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
| 4831 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
| 4832 |
ted at the outer level. |
ted at the outer level. |
| 4833 |
|
|
| 4834 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
| 4835 |
|
|
| 4836 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
| 4837 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
| 4838 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
| 4839 |
|
|
| 4840 |
Recursion difference from Perl |
Recursion difference from Perl |
| 4841 |
|
|
| 4842 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
| 4843 |
always treated as an atomic group. That is, once it has matched some of |
always treated as an atomic group. That is, once it has matched some of |
| 4844 |
the subject string, it is never re-entered, even if it contains untried |
the subject string, it is never re-entered, even if it contains untried |
| 4845 |
alternatives and there is a subsequent matching failure. This can be |
alternatives and there is a subsequent matching failure. This can be |
| 4846 |
illustrated by the following pattern, which purports to match a palin- |
illustrated by the following pattern, which purports to match a palin- |
| 4847 |
dromic string that contains an odd number of characters (for example, |
dromic string that contains an odd number of characters (for example, |
| 4848 |
"a", "aba", "abcba", "abcdcba"): |
"a", "aba", "abcba", "abcdcba"): |
| 4849 |
|
|
| 4850 |
^(.|(.)(?1)\2)$ |
^(.|(.)(?1)\2)$ |
| 4851 |
|
|
| 4852 |
The idea is that it either matches a single character, or two identical |
The idea is that it either matches a single character, or two identical |
| 4853 |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
| 4854 |
in PCRE it does not if the pattern is longer than three characters. |
in PCRE it does not if the pattern is longer than three characters. |
| 4855 |
Consider the subject string "abcba": |
Consider the subject string "abcba": |
| 4856 |
|
|
| 4857 |
At the top level, the first character is matched, but as it is not at |
At the top level, the first character is matched, but as it is not at |
| 4858 |
the end of the string, the first alternative fails; the second alterna- |
the end of the string, the first alternative fails; the second alterna- |
| 4859 |
tive is taken and the recursion kicks in. The recursive call to subpat- |
tive is taken and the recursion kicks in. The recursive call to subpat- |
| 4860 |
tern 1 successfully matches the next character ("b"). (Note that the |
tern 1 successfully matches the next character ("b"). (Note that the |
| 4861 |
beginning and end of line tests are not part of the recursion). |
beginning and end of line tests are not part of the recursion). |
| 4862 |
|
|
| 4863 |
Back at the top level, the next character ("c") is compared with what |
Back at the top level, the next character ("c") is compared with what |
| 4864 |
subpattern 2 matched, which was "a". This fails. Because the recursion |
subpattern 2 matched, which was "a". This fails. Because the recursion |
| 4865 |
is treated as an atomic group, there are now no backtracking points, |
is treated as an atomic group, there are now no backtracking points, |
| 4866 |
and so the entire match fails. (Perl is able, at this point, to re- |
and so the entire match fails. (Perl is able, at this point, to re- |
| 4867 |
enter the recursion and try the second alternative.) However, if the |
enter the recursion and try the second alternative.) However, if the |
| 4868 |
pattern is written with the alternatives in the other order, things are |
pattern is written with the alternatives in the other order, things are |
| 4869 |
different: |
different: |
| 4870 |
|
|
| 4871 |
^((.)(?1)\2|.)$ |
^((.)(?1)\2|.)$ |
| 4872 |
|
|
| 4873 |
This time, the recursing alternative is tried first, and continues to |
This time, the recursing alternative is tried first, and continues to |
| 4874 |
recurse until it runs out of characters, at which point the recursion |
recurse until it runs out of characters, at which point the recursion |
| 4875 |
fails. But this time we do have another alternative to try at the |
fails. But this time we do have another alternative to try at the |
| 4876 |
higher level. That is the big difference: in the previous case the |
higher level. That is the big difference: in the previous case the |
| 4877 |
remaining alternative is at a deeper recursion level, which PCRE cannot |
remaining alternative is at a deeper recursion level, which PCRE cannot |
| 4878 |
use. |
use. |
| 4879 |
|
|
| 4880 |
To change the pattern so that matches all palindromic strings, not just |
To change the pattern so that matches all palindromic strings, not just |
| 4881 |
those with an odd number of characters, it is tempting to change the |
those with an odd number of characters, it is tempting to change the |
| 4882 |
pattern to this: |
pattern to this: |
| 4883 |
|
|
| 4884 |
^((.)(?1)\2|.?)$ |
^((.)(?1)\2|.?)$ |
| 4885 |
|
|
| 4886 |
Again, this works in Perl, but not in PCRE, and for the same reason. |
Again, this works in Perl, but not in PCRE, and for the same reason. |
| 4887 |
When a deeper recursion has matched a single character, it cannot be |
When a deeper recursion has matched a single character, it cannot be |
| 4888 |
entered again in order to match an empty string. The solution is to |
entered again in order to match an empty string. The solution is to |
| 4889 |
separate the two cases, and write out the odd and even cases as alter- |
separate the two cases, and write out the odd and even cases as alter- |
| 4890 |
natives at the higher level: |
natives at the higher level: |
| 4891 |
|
|
| 4892 |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
| 4893 |
|
|
| 4894 |
If you want to match typical palindromic phrases, the pattern has to |
If you want to match typical palindromic phrases, the pattern has to |
| 4895 |
ignore all non-word characters, which can be done like this: |
ignore all non-word characters, which can be done like this: |
| 4896 |
|
|
| 4897 |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+4|\W*+.\W*+))\W*+$ |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+4|\W*+.\W*+))\W*+$ |
| 4898 |
|
|
| 4899 |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
| 4900 |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
| 4901 |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
| 4902 |
ing into sequences of non-word characters. Without this, PCRE takes a |
ing into sequences of non-word characters. Without this, PCRE takes a |
| 4903 |
great deal longer (ten times or more) to match typical phrases, and |
great deal longer (ten times or more) to match typical phrases, and |
| 4904 |
Perl takes so long that you think it has gone into a loop. |
Perl takes so long that you think it has gone into a loop. |
| 4905 |
|
|
| 4906 |
|
|
| 4907 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
| 4908 |
|
|
| 4909 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
| 4910 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
| 4911 |
ates like a subroutine in a programming language. The "called" subpat- |
ates like a subroutine in a programming language. The "called" subpat- |
| 4912 |
tern may be defined before or after the reference. A numbered reference |
tern may be defined before or after the reference. A numbered reference |
| 4913 |
can be absolute or relative, as in these examples: |
can be absolute or relative, as in these examples: |
| 4914 |
|
|
| 4920 |
|
|
| 4921 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
| 4922 |
|
|
| 4923 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
| 4924 |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
| 4925 |
|
|
| 4926 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
| 4927 |
|
|
| 4928 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
| 4929 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
| 4930 |
above. |
above. |
| 4931 |
|
|
| 4932 |
Like recursive subpatterns, a "subroutine" call is always treated as an |
Like recursive subpatterns, a "subroutine" call is always treated as an |
| 4933 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
| 4934 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
| 4935 |
there is a subsequent matching failure. |
there is a subsequent matching failure. |
| 4936 |
|
|
| 4937 |
When a subpattern is used as a subroutine, processing options such as |
When a subpattern is used as a subroutine, processing options such as |
| 4938 |
case-independence are fixed when the subpattern is defined. They cannot |
case-independence are fixed when the subpattern is defined. They cannot |
| 4939 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
| 4940 |
|
|
| 4941 |
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
| 4942 |
|
|
| 4943 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
| 4944 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
| 4945 |
|
|
| 4946 |
|
|
| 4947 |
ONIGURUMA SUBROUTINE SYNTAX |
ONIGURUMA SUBROUTINE SYNTAX |
| 4948 |
|
|
| 4949 |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
| 4950 |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
| 4951 |
an alternative syntax for referencing a subpattern as a subroutine, |
an alternative syntax for referencing a subpattern as a subroutine, |
| 4952 |
possibly recursively. Here are two of the examples used above, rewrit- |
possibly recursively. Here are two of the examples used above, rewrit- |
| 4953 |
ten using this syntax: |
ten using this syntax: |
| 4954 |
|
|
| 4955 |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
| 4956 |
(sens|respons)e and \g'1'ibility |
(sens|respons)e and \g'1'ibility |
| 4957 |
|
|
| 4958 |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
| 4959 |
plus or a minus sign it is taken as a relative reference. For example: |
plus or a minus sign it is taken as a relative reference. For example: |
| 4960 |
|
|
| 4961 |
(abc)(?i:\g<-1>) |
(abc)(?i:\g<-1>) |
| 4962 |
|
|
| 4963 |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
| 4964 |
synonymous. The former is a back reference; the latter is a subroutine |
synonymous. The former is a back reference; the latter is a subroutine |
| 4965 |
call. |
call. |
| 4966 |
|
|
| 4967 |
|
|
| 4968 |
CALLOUTS |
CALLOUTS |
| 4969 |
|
|
| 4970 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
| 4971 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
| 4972 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
| 4973 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
| 4974 |
tion. |
tion. |
| 4975 |
|
|
| 4976 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
| 4977 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
| 4978 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
| 4979 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
| 4980 |
all calling out. |
all calling out. |
| 4981 |
|
|
| 4982 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
| 4983 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
| 4984 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
| 4985 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
| 4986 |
points: |
points: |
| 4987 |
|
|
| 4988 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
| 4989 |
|
|
| 4990 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
| 4991 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
| 4992 |
numbered 255. |
numbered 255. |
| 4993 |
|
|
| 4994 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
| 4995 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
| 4996 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
| 4997 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
| 4998 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
| 4999 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
| 5000 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
| 5001 |
|
|
| 5002 |
|
|
| 5003 |
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
| 5004 |
|
|
| 5005 |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
| 5006 |
which are described in the Perl documentation as "experimental and sub- |
which are described in the Perl documentation as "experimental and sub- |
| 5007 |
ject to change or removal in a future version of Perl". It goes on to |
ject to change or removal in a future version of Perl". It goes on to |
| 5008 |
say: "Their usage in production code should be noted to avoid problems |
say: "Their usage in production code should be noted to avoid problems |
| 5009 |
during upgrades." The same remarks apply to the PCRE features described |
during upgrades." The same remarks apply to the PCRE features described |
| 5010 |
in this section. |
in this section. |
| 5011 |
|
|
| 5012 |
Since these verbs are specifically related to backtracking, most of |
Since these verbs are specifically related to backtracking, most of |
| 5013 |
them can be used only when the pattern is to be matched using |
them can be used only when the pattern is to be matched using |
| 5014 |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
| 5015 |
(*FAIL), which behaves like a failing negative assertion, they cause an |
(*FAIL), which behaves like a failing negative assertion, they cause an |
| 5016 |
error if encountered by pcre_dfa_exec(). |
error if encountered by pcre_dfa_exec(). |
| 5017 |
|
|
| 5018 |
If any of these verbs are used in an assertion subpattern, their effect |
If any of these verbs are used in an assertion subpattern, their effect |
| 5019 |
is confined to that subpattern; it does not extend to the surrounding |
is confined to that subpattern; it does not extend to the surrounding |
| 5020 |
pattern. Note that assertion subpatterns are processed as anchored at |
pattern. Note that assertion subpatterns are processed as anchored at |
| 5021 |
the point where they are tested. |
the point where they are tested. |
| 5022 |
|
|
| 5023 |
The new verbs make use of what was previously invalid syntax: an open- |
The new verbs make use of what was previously invalid syntax: an open- |
| 5024 |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
| 5025 |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
| 5026 |
its general form is just (*VERB). Any number of these verbs may occur |
its general form is just (*VERB). Any number of these verbs may occur |
| 5027 |
in a pattern. There are two kinds: |
in a pattern. There are two kinds: |
| 5028 |
|
|
| 5029 |
Verbs that act immediately |
Verbs that act immediately |
| 5032 |
|
|
| 5033 |
(*ACCEPT) |
(*ACCEPT) |
| 5034 |
|
|
| 5035 |
This verb causes the match to end successfully, skipping the remainder |
This verb causes the match to end successfully, skipping the remainder |
| 5036 |
of the pattern. When inside a recursion, only the innermost pattern is |
of the pattern. When inside a recursion, only the innermost pattern is |
| 5037 |
ended immediately. If the (*ACCEPT) is inside capturing parentheses, |
ended immediately. If the (*ACCEPT) is inside capturing parentheses, |
| 5038 |
the data so far is captured. (This feature was added to PCRE at release |
the data so far is captured. (This feature was added to PCRE at release |
| 5039 |
8.00.) For example: |
8.00.) For example: |
| 5040 |
|
|
| 5041 |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
| 5042 |
|
|
| 5043 |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
| 5044 |
tured by the outer parentheses. |
tured by the outer parentheses. |
| 5045 |
|
|
| 5046 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
| 5047 |
|
|
| 5048 |
This verb causes the match to fail, forcing backtracking to occur. It |
This verb causes the match to fail, forcing backtracking to occur. It |
| 5049 |
is equivalent to (?!) but easier to read. The Perl documentation notes |
is equivalent to (?!) but easier to read. The Perl documentation notes |
| 5050 |
that it is probably useful only when combined with (?{}) or (??{}). |
that it is probably useful only when combined with (?{}) or (??{}). |
| 5051 |
Those are, of course, Perl features that are not present in PCRE. The |
Those are, of course, Perl features that are not present in PCRE. The |
| 5052 |
nearest equivalent is the callout feature, as for example in this pat- |
nearest equivalent is the callout feature, as for example in this pat- |
| 5053 |
tern: |
tern: |
| 5054 |
|
|
| 5055 |
a+(?C)(*FAIL) |
a+(?C)(*FAIL) |
| 5056 |
|
|
| 5057 |
A match with the string "aaaa" always fails, but the callout is taken |
A match with the string "aaaa" always fails, but the callout is taken |
| 5058 |
before each backtrack happens (in this example, 10 times). |
before each backtrack happens (in this example, 10 times). |
| 5059 |
|
|
| 5060 |
Verbs that act after backtracking |
Verbs that act after backtracking |
| 5061 |
|
|
| 5062 |
The following verbs do nothing when they are encountered. Matching con- |
The following verbs do nothing when they are encountered. Matching con- |
| 5063 |
tinues with what follows, but if there is no subsequent match, a fail- |
tinues with what follows, but if there is no subsequent match, a fail- |
| 5064 |
ure is forced. The verbs differ in exactly what kind of failure |
ure is forced. The verbs differ in exactly what kind of failure |
| 5065 |
occurs. |
occurs. |
| 5066 |
|
|
| 5067 |
(*COMMIT) |
(*COMMIT) |
| 5068 |
|
|
| 5069 |
This verb causes the whole match to fail outright if the rest of the |
This verb causes the whole match to fail outright if the rest of the |
| 5070 |
pattern does not match. Even if the pattern is unanchored, no further |
pattern does not match. Even if the pattern is unanchored, no further |
| 5071 |
attempts to find a match by advancing the start point take place. Once |
attempts to find a match by advancing the start point take place. Once |
| 5072 |
(*COMMIT) has been passed, pcre_exec() is committed to finding a match |
(*COMMIT) has been passed, pcre_exec() is committed to finding a match |
| 5073 |
at the current starting point, or not at all. For example: |
at the current starting point, or not at all. For example: |
| 5074 |
|
|
| 5075 |
a+(*COMMIT)b |
a+(*COMMIT)b |
| 5076 |
|
|
| 5077 |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
| 5078 |
of dynamic anchor, or "I've started, so I must finish." |
of dynamic anchor, or "I've started, so I must finish." |
| 5079 |
|
|
| 5080 |
(*PRUNE) |
(*PRUNE) |
| 5081 |
|
|
| 5082 |
This verb causes the match to fail at the current position if the rest |
This verb causes the match to fail at the current position if the rest |
| 5083 |
of the pattern does not match. If the pattern is unanchored, the normal |
of the pattern does not match. If the pattern is unanchored, the normal |
| 5084 |
"bumpalong" advance to the next starting character then happens. Back- |
"bumpalong" advance to the next starting character then happens. Back- |
| 5085 |
tracking can occur as usual to the left of (*PRUNE), or when matching |
tracking can occur as usual to the left of (*PRUNE), or when matching |
| 5086 |
to the right of (*PRUNE), but if there is no match to the right, back- |
to the right of (*PRUNE), but if there is no match to the right, back- |
| 5087 |
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
| 5088 |
is just an alternative to an atomic group or possessive quantifier, but |
is just an alternative to an atomic group or possessive quantifier, but |
| 5089 |
there are some uses of (*PRUNE) that cannot be expressed in any other |
there are some uses of (*PRUNE) that cannot be expressed in any other |
| 5090 |
way. |
way. |
| 5091 |
|
|
| 5092 |
(*SKIP) |
(*SKIP) |
| 5093 |
|
|
| 5094 |
This verb is like (*PRUNE), except that if the pattern is unanchored, |
This verb is like (*PRUNE), except that if the pattern is unanchored, |
| 5095 |
the "bumpalong" advance is not to the next character, but to the posi- |
the "bumpalong" advance is not to the next character, but to the posi- |
| 5096 |
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
| 5097 |
that whatever text was matched leading up to it cannot be part of a |
that whatever text was matched leading up to it cannot be part of a |
| 5098 |
successful match. Consider: |
successful match. Consider: |
| 5099 |
|
|
| 5100 |
a+(*SKIP)b |
a+(*SKIP)b |
| 5101 |
|
|
| 5102 |
If the subject is "aaaac...", after the first match attempt fails |
If the subject is "aaaac...", after the first match attempt fails |
| 5103 |
(starting at the first character in the string), the starting point |
(starting at the first character in the string), the starting point |
| 5104 |
skips on to start the next attempt at "c". Note that a possessive quan- |
skips on to start the next attempt at "c". Note that a possessive quan- |
| 5105 |
tifer does not have the same effect in this example; although it would |
tifer does not have the same effect in this example; although it would |
| 5106 |
suppress backtracking during the first match attempt, the second |
suppress backtracking during the first match attempt, the second |
| 5107 |
attempt would start at the second character instead of skipping on to |
attempt would start at the second character instead of skipping on to |
| 5108 |
"c". |
"c". |
| 5109 |
|
|
| 5110 |
(*THEN) |
(*THEN) |
| 5111 |
|
|
| 5112 |
This verb causes a skip to the next alternation if the rest of the pat- |
This verb causes a skip to the next alternation if the rest of the pat- |
| 5113 |
tern does not match. That is, it cancels pending backtracking, but only |
tern does not match. That is, it cancels pending backtracking, but only |
| 5114 |
within the current alternation. Its name comes from the observation |
within the current alternation. Its name comes from the observation |
| 5115 |
that it can be used for a pattern-based if-then-else block: |
that it can be used for a pattern-based if-then-else block: |
| 5116 |
|
|
| 5117 |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
| 5118 |
|
|
| 5119 |
If the COND1 pattern matches, FOO is tried (and possibly further items |
If the COND1 pattern matches, FOO is tried (and possibly further items |
| 5120 |
after the end of the group if FOO succeeds); on failure the matcher |
after the end of the group if FOO succeeds); on failure the matcher |
| 5121 |
skips to the second alternative and tries COND2, without backtracking |
skips to the second alternative and tries COND2, without backtracking |
| 5122 |
into COND1. If (*THEN) is used outside of any alternation, it acts |
into COND1. If (*THEN) is used outside of any alternation, it acts |
| 5123 |
exactly like (*PRUNE). |
exactly like (*PRUNE). |
| 5124 |
|
|
| 5125 |
|
|
| 5137 |
|
|
| 5138 |
REVISION |
REVISION |
| 5139 |
|
|
| 5140 |
Last updated: 18 September 2009 |
Last updated: 22 September 2009 |
| 5141 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
| 5142 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5143 |
|
|