| 3246 |
\n linefeed (hex 0A) |
\n linefeed (hex 0A) |
| 3247 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
| 3248 |
\t tab (hex 09) |
\t tab (hex 09) |
| 3249 |
\ddd character with octal code ddd, or backreference |
\ddd character with octal code ddd, or back reference |
| 3250 |
\xhh character with hex code hh |
\xhh character with hex code hh |
| 3251 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. |
| 3252 |
|
|
| 4051 |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
| 4052 |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
| 4053 |
|
|
| 4054 |
A backreference to a numbered subpattern uses the most recent value |
A back reference to a numbered subpattern uses the most recent value |
| 4055 |
that is set for that number by any subpattern. The following pattern |
that is set for that number by any subpattern. The following pattern |
| 4056 |
matches "abcabc" or "defdef": |
matches "abcabc" or "defdef": |
| 4057 |
|
|
| 4085 |
|
|
| 4086 |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
| 4087 |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
| 4088 |
to capturing parentheses from other parts of the pattern, such as back- |
to capturing parentheses from other parts of the pattern, such as back |
| 4089 |
references, recursion, and conditions, can be made by name as well as |
references, recursion, and conditions, can be made by name as well as |
| 4090 |
by number. |
by number. |
| 4091 |
|
|
| 4121 |
that name that matched. This saves searching to find which numbered |
that name that matched. This saves searching to find which numbered |
| 4122 |
subpattern it was. |
subpattern it was. |
| 4123 |
|
|
| 4124 |
If you make a backreference to a non-unique named subpattern from else- |
If you make a back reference to a non-unique named subpattern from |
| 4125 |
where in the pattern, the one that corresponds to the first occurrence |
elsewhere in the pattern, the one that corresponds to the first occur- |
| 4126 |
of the name is used. In the absence of duplicate numbers (see the pre- |
rence of the name is used. In the absence of duplicate numbers (see the |
| 4127 |
vious section) this is the one with the lowest number. If you use a |
previous section) this is the one with the lowest number. If you use a |
| 4128 |
named reference in a condition test (see the section about conditions |
named reference in a condition test (see the section about conditions |
| 4129 |
below), either to check whether a subpattern has matched, or to check |
below), either to check whether a subpattern has matched, or to check |
| 4130 |
for recursion, all subpatterns with the same name are tested. If the |
for recursion, all subpatterns with the same name are tested. If the |
| 4270 |
mization, or alternatively using ^ to indicate anchoring explicitly. |
mization, or alternatively using ^ to indicate anchoring explicitly. |
| 4271 |
|
|
| 4272 |
However, there is one situation where the optimization cannot be used. |
However, there is one situation where the optimization cannot be used. |
| 4273 |
When .* is inside capturing parentheses that are the subject of a |
When .* is inside capturing parentheses that are the subject of a back |
| 4274 |
backreference elsewhere in the pattern, a match at the start may fail |
reference elsewhere in the pattern, a match at the start may fail where |
| 4275 |
where a later one succeeds. Consider, for example: |
a later one succeeds. Consider, for example: |
| 4276 |
|
|
| 4277 |
(.*)abc\1 |
(.*)abc\1 |
| 4278 |
|
|
| 4494 |
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ |
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ |
| 4495 |
syntax or an empty comment (see "Comments" below) can be used. |
syntax or an empty comment (see "Comments" below) can be used. |
| 4496 |
|
|
| 4497 |
|
Recursive back references |
| 4498 |
|
|
| 4499 |
A back reference that occurs inside the parentheses to which it refers |
A back reference that occurs inside the parentheses to which it refers |
| 4500 |
fails when the subpattern is first used, so, for example, (a\1) never |
fails when the subpattern is first used, so, for example, (a\1) never |
| 4501 |
matches. However, such references can be useful inside repeated sub- |
matches. However, such references can be useful inside repeated sub- |
| 4510 |
to match the back reference. This can be done using alternation, as in |
to match the back reference. This can be done using alternation, as in |
| 4511 |
the example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of zero. |
| 4512 |
|
|
| 4513 |
|
Back references of this type cause the group that they reference to be |
| 4514 |
|
treated as an atomic group. Once the whole group has been matched, a |
| 4515 |
|
subsequent matching failure cannot cause backtracking into the middle |
| 4516 |
|
of the group. |
| 4517 |
|
|
| 4518 |
|
|
| 4519 |
ASSERTIONS |
ASSERTIONS |
| 4520 |
|
|
| 4521 |
An assertion is a test on the characters following or preceding the |
An assertion is a test on the characters following or preceding the |
| 4522 |
current matching point that does not actually consume any characters. |
current matching point that does not actually consume any characters. |
| 4523 |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
| 4524 |
described above. |
described above. |
| 4525 |
|
|
| 4526 |
More complicated assertions are coded as subpatterns. There are two |
More complicated assertions are coded as subpatterns. There are two |
| 4527 |
kinds: those that look ahead of the current position in the subject |
kinds: those that look ahead of the current position in the subject |
| 4528 |
string, and those that look behind it. An assertion subpattern is |
string, and those that look behind it. An assertion subpattern is |
| 4529 |
matched in the normal way, except that it does not cause the current |
matched in the normal way, except that it does not cause the current |
| 4530 |
matching position to be changed. |
matching position to be changed. |
| 4531 |
|
|
| 4532 |
Assertion subpatterns are not capturing subpatterns, and may not be |
Assertion subpatterns are not capturing subpatterns, and may not be |
| 4533 |
repeated, because it makes no sense to assert the same thing several |
repeated, because it makes no sense to assert the same thing several |
| 4534 |
times. If any kind of assertion contains capturing subpatterns within |
times. If any kind of assertion contains capturing subpatterns within |
| 4535 |
it, these are counted for the purposes of numbering the capturing sub- |
it, these are counted for the purposes of numbering the capturing sub- |
| 4536 |
patterns in the whole pattern. However, substring capturing is carried |
patterns in the whole pattern. However, substring capturing is carried |
| 4537 |
out only for positive assertions, because it does not make sense for |
out only for positive assertions, because it does not make sense for |
| 4538 |
negative assertions. |
negative assertions. |
| 4539 |
|
|
| 4540 |
Lookahead assertions |
Lookahead assertions |
| 4544 |
|
|
| 4545 |
\w+(?=;) |
\w+(?=;) |
| 4546 |
|
|
| 4547 |
matches a word followed by a semicolon, but does not include the semi- |
matches a word followed by a semicolon, but does not include the semi- |
| 4548 |
colon in the match, and |
colon in the match, and |
| 4549 |
|
|
| 4550 |
foo(?!bar) |
foo(?!bar) |
| 4551 |
|
|
| 4552 |
matches any occurrence of "foo" that is not followed by "bar". Note |
matches any occurrence of "foo" that is not followed by "bar". Note |
| 4553 |
that the apparently similar pattern |
that the apparently similar pattern |
| 4554 |
|
|
| 4555 |
(?!foo)bar |
(?!foo)bar |
| 4556 |
|
|
| 4557 |
does not find an occurrence of "bar" that is preceded by something |
does not find an occurrence of "bar" that is preceded by something |
| 4558 |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
| 4559 |
the assertion (?!foo) is always true when the next three characters are |
the assertion (?!foo) is always true when the next three characters are |
| 4560 |
"bar". A lookbehind assertion is needed to achieve the other effect. |
"bar". A lookbehind assertion is needed to achieve the other effect. |
| 4561 |
|
|
| 4562 |
If you want to force a matching failure at some point in a pattern, the |
If you want to force a matching failure at some point in a pattern, the |
| 4563 |
most convenient way to do it is with (?!) because an empty string |
most convenient way to do it is with (?!) because an empty string |
| 4564 |
always matches, so an assertion that requires there not to be an empty |
always matches, so an assertion that requires there not to be an empty |
| 4565 |
string must always fail. The Perl 5.10 backtracking control verb |
string must always fail. The Perl 5.10 backtracking control verb |
| 4566 |
(*FAIL) or (*F) is essentially a synonym for (?!). |
(*FAIL) or (*F) is essentially a synonym for (?!). |
| 4567 |
|
|
| 4568 |
Lookbehind assertions |
Lookbehind assertions |
| 4569 |
|
|
| 4570 |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
| 4571 |
for negative assertions. For example, |
for negative assertions. For example, |
| 4572 |
|
|
| 4573 |
(?<!foo)bar |
(?<!foo)bar |
| 4574 |
|
|
| 4575 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
| 4576 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
| 4577 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
| 4578 |
eral top-level alternatives, they do not all have to have the same |
eral top-level alternatives, they do not all have to have the same |
| 4579 |
fixed length. Thus |
fixed length. Thus |
| 4580 |
|
|
| 4581 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
| 4584 |
|
|
| 4585 |
(?<!dogs?|cats?) |
(?<!dogs?|cats?) |
| 4586 |
|
|
| 4587 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
| 4588 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
| 4589 |
This is an extension compared with Perl (5.8 and 5.10), which requires |
This is an extension compared with Perl (5.8 and 5.10), which requires |
| 4590 |
all branches to match the same length of string. An assertion such as |
all branches to match the same length of string. An assertion such as |
| 4591 |
|
|
| 4592 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
| 4593 |
|
|
| 4594 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
| 4595 |
different lengths, but it is acceptable to PCRE if rewritten to use two |
different lengths, but it is acceptable to PCRE if rewritten to use two |
| 4596 |
top-level branches: |
top-level branches: |
| 4597 |
|
|
| 4598 |
(?<=abc|abde) |
(?<=abc|abde) |
| 4599 |
|
|
| 4600 |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
| 4601 |
instead of a lookbehind assertion to get round the fixed-length |
instead of a lookbehind assertion to get round the fixed-length |
| 4602 |
restriction. |
restriction. |
| 4603 |
|
|
| 4604 |
The implementation of lookbehind assertions is, for each alternative, |
The implementation of lookbehind assertions is, for each alternative, |
| 4605 |
to temporarily move the current position back by the fixed length and |
to temporarily move the current position back by the fixed length and |
| 4606 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
| 4607 |
rent position, the assertion fails. |
rent position, the assertion fails. |
| 4608 |
|
|
| 4609 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
| 4610 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
mode) to appear in lookbehind assertions, because it makes it impossi- |
| 4611 |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
| 4612 |
which can match different numbers of bytes, are also not permitted. |
which can match different numbers of bytes, are also not permitted. |
| 4613 |
|
|
| 4614 |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
| 4615 |
lookbehinds, as long as the subpattern matches a fixed-length string. |
lookbehinds, as long as the subpattern matches a fixed-length string. |
| 4616 |
Recursion, however, is not supported. |
Recursion, however, is not supported. |
| 4617 |
|
|
| 4618 |
Possessive quantifiers can be used in conjunction with lookbehind |
Possessive quantifiers can be used in conjunction with lookbehind |
| 4619 |
assertions to specify efficient matching of fixed-length strings at the |
assertions to specify efficient matching of fixed-length strings at the |
| 4620 |
end of subject strings. Consider a simple pattern such as |
end of subject strings. Consider a simple pattern such as |
| 4621 |
|
|
| 4622 |
abcd$ |
abcd$ |
| 4623 |
|
|
| 4624 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
| 4625 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
| 4626 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
| 4627 |
pattern is specified as |
pattern is specified as |
| 4628 |
|
|
| 4629 |
^.*abcd$ |
^.*abcd$ |
| 4630 |
|
|
| 4631 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
| 4632 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
| 4633 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
| 4634 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
| 4635 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
| 4636 |
|
|
| 4637 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
| 4638 |
|
|
| 4639 |
there can be no backtracking for the .*+ item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
| 4640 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
| 4641 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
| 4642 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
| 4643 |
processing time. |
processing time. |
| 4644 |
|
|
| 4645 |
Using multiple assertions |
Using multiple assertions |
| 4648 |
|
|
| 4649 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
| 4650 |
|
|
| 4651 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
| 4652 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
| 4653 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
| 4654 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
| 4655 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
| 4656 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
| 4657 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
| 4658 |
foo". A pattern to do that is |
foo". A pattern to do that is |
| 4659 |
|
|
| 4660 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
| 4661 |
|
|
| 4662 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
| 4663 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
| 4664 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
| 4665 |
|
|
| 4667 |
|
|
| 4668 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
| 4669 |
|
|
| 4670 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
| 4671 |
is not preceded by "foo", while |
is not preceded by "foo", while |
| 4672 |
|
|
| 4673 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
| 4674 |
|
|
| 4675 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
| 4676 |
three characters that are not "999". |
three characters that are not "999". |
| 4677 |
|
|
| 4678 |
|
|
| 4679 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
| 4680 |
|
|
| 4681 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
| 4682 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
| 4683 |
on the result of an assertion, or whether a specific capturing subpat- |
on the result of an assertion, or whether a specific capturing subpat- |
| 4684 |
tern has already been matched. The two possible forms of conditional |
tern has already been matched. The two possible forms of conditional |
| 4685 |
subpattern are: |
subpattern are: |
| 4686 |
|
|
| 4687 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
| 4688 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
| 4689 |
|
|
| 4690 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
| 4691 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
| 4692 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
| 4693 |
|
|
| 4694 |
There are four kinds of condition: references to subpatterns, refer- |
There are four kinds of condition: references to subpatterns, refer- |
| 4695 |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
| 4696 |
|
|
| 4697 |
Checking for a used subpattern by number |
Checking for a used subpattern by number |
| 4698 |
|
|
| 4699 |
If the text between the parentheses consists of a sequence of digits, |
If the text between the parentheses consists of a sequence of digits, |
| 4700 |
the condition is true if a capturing subpattern of that number has pre- |
the condition is true if a capturing subpattern of that number has pre- |
| 4701 |
viously matched. If there is more than one capturing subpattern with |
viously matched. If there is more than one capturing subpattern with |
| 4702 |
the same number (see the earlier section about duplicate subpattern |
the same number (see the earlier section about duplicate subpattern |
| 4703 |
numbers), the condition is true if any of them have been set. An alter- |
numbers), the condition is true if any of them have been set. An alter- |
| 4704 |
native notation is to precede the digits with a plus or minus sign. In |
native notation is to precede the digits with a plus or minus sign. In |
| 4705 |
this case, the subpattern number is relative rather than absolute. The |
this case, the subpattern number is relative rather than absolute. The |
| 4706 |
most recently opened parentheses can be referenced by (?(-1), the next |
most recently opened parentheses can be referenced by (?(-1), the next |
| 4707 |
most recent by (?(-2), and so on. In looping constructs it can also |
most recent by (?(-2), and so on. In looping constructs it can also |
| 4708 |
make sense to refer to subsequent groups with constructs such as |
make sense to refer to subsequent groups with constructs such as |
| 4709 |
(?(+2). |
(?(+2). |
| 4710 |
|
|
| 4711 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
| 4712 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
| 4713 |
divide it into three parts for ease of discussion: |
divide it into three parts for ease of discussion: |
| 4714 |
|
|
| 4715 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
| 4716 |
|
|
| 4717 |
The first part matches an optional opening parenthesis, and if that |
The first part matches an optional opening parenthesis, and if that |
| 4718 |
character is present, sets it as the first captured substring. The sec- |
character is present, sets it as the first captured substring. The sec- |
| 4719 |
ond part matches one or more characters that are not parentheses. The |
ond part matches one or more characters that are not parentheses. The |
| 4720 |
third part is a conditional subpattern that tests whether the first set |
third part is a conditional subpattern that tests whether the first set |
| 4721 |
of parentheses matched or not. If they did, that is, if subject started |
of parentheses matched or not. If they did, that is, if subject started |
| 4722 |
with an opening parenthesis, the condition is true, and so the yes-pat- |
with an opening parenthesis, the condition is true, and so the yes-pat- |
| 4723 |
tern is executed and a closing parenthesis is required. Otherwise, |
tern is executed and a closing parenthesis is required. Otherwise, |
| 4724 |
since no-pattern is not present, the subpattern matches nothing. In |
since no-pattern is not present, the subpattern matches nothing. In |
| 4725 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
| 4726 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. |
| 4727 |
|
|
| 4728 |
If you were embedding this pattern in a larger one, you could use a |
If you were embedding this pattern in a larger one, you could use a |
| 4729 |
relative reference: |
relative reference: |
| 4730 |
|
|
| 4731 |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
| 4732 |
|
|
| 4733 |
This makes the fragment independent of the parentheses in the larger |
This makes the fragment independent of the parentheses in the larger |
| 4734 |
pattern. |
pattern. |
| 4735 |
|
|
| 4736 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
| 4737 |
|
|
| 4738 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
| 4739 |
used subpattern by name. For compatibility with earlier versions of |
used subpattern by name. For compatibility with earlier versions of |
| 4740 |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
| 4741 |
also recognized. However, there is a possible ambiguity with this syn- |
also recognized. However, there is a possible ambiguity with this syn- |
| 4742 |
tax, because subpattern names may consist entirely of digits. PCRE |
tax, because subpattern names may consist entirely of digits. PCRE |
| 4743 |
looks first for a named subpattern; if it cannot find one and the name |
looks first for a named subpattern; if it cannot find one and the name |
| 4744 |
consists entirely of digits, PCRE looks for a subpattern of that num- |
consists entirely of digits, PCRE looks for a subpattern of that num- |
| 4745 |
ber, which must be greater than zero. Using subpattern names that con- |
ber, which must be greater than zero. Using subpattern names that con- |
| 4746 |
sist entirely of digits is not recommended. |
sist entirely of digits is not recommended. |
| 4747 |
|
|
| 4748 |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
| 4749 |
|
|
| 4750 |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
| 4751 |
|
|
| 4752 |
If the name used in a condition of this kind is a duplicate, the test |
If the name used in a condition of this kind is a duplicate, the test |
| 4753 |
is applied to all subpatterns of the same name, and is true if any one |
is applied to all subpatterns of the same name, and is true if any one |
| 4754 |
of them has matched. |
of them has matched. |
| 4755 |
|
|
| 4756 |
Checking for pattern recursion |
Checking for pattern recursion |
| 4757 |
|
|
| 4758 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
| 4759 |
name R, the condition is true if a recursive call to the whole pattern |
name R, the condition is true if a recursive call to the whole pattern |
| 4760 |
or any subpattern has been made. If digits or a name preceded by amper- |
or any subpattern has been made. If digits or a name preceded by amper- |
| 4761 |
sand follow the letter R, for example: |
sand follow the letter R, for example: |
| 4762 |
|
|
| 4764 |
|
|
| 4765 |
the condition is true if the most recent recursion is into a subpattern |
the condition is true if the most recent recursion is into a subpattern |
| 4766 |
whose number or name is given. This condition does not check the entire |
whose number or name is given. This condition does not check the entire |
| 4767 |
recursion stack. If the name used in a condition of this kind is a |
recursion stack. If the name used in a condition of this kind is a |
| 4768 |
duplicate, the test is applied to all subpatterns of the same name, and |
duplicate, the test is applied to all subpatterns of the same name, and |
| 4769 |
is true if any one of them is the most recent recursion. |
is true if any one of them is the most recent recursion. |
| 4770 |
|
|
| 4771 |
At "top level", all these recursion test conditions are false. The |
At "top level", all these recursion test conditions are false. The |
| 4772 |
syntax for recursive patterns is described below. |
syntax for recursive patterns is described below. |
| 4773 |
|
|
| 4774 |
Defining subpatterns for use by reference only |
Defining subpatterns for use by reference only |
| 4775 |
|
|
| 4776 |
If the condition is the string (DEFINE), and there is no subpattern |
If the condition is the string (DEFINE), and there is no subpattern |
| 4777 |
with the name DEFINE, the condition is always false. In this case, |
with the name DEFINE, the condition is always false. In this case, |
| 4778 |
there may be only one alternative in the subpattern. It is always |
there may be only one alternative in the subpattern. It is always |
| 4779 |
skipped if control reaches this point in the pattern; the idea of |
skipped if control reaches this point in the pattern; the idea of |
| 4780 |
DEFINE is that it can be used to define "subroutines" that can be ref- |
DEFINE is that it can be used to define "subroutines" that can be ref- |
| 4781 |
erenced from elsewhere. (The use of "subroutines" is described below.) |
erenced from elsewhere. (The use of "subroutines" is described below.) |
| 4782 |
For example, a pattern to match an IPv4 address could be written like |
For example, a pattern to match an IPv4 address could be written like |
| 4783 |
this (ignore whitespace and line breaks): |
this (ignore whitespace and line breaks): |
| 4784 |
|
|
| 4785 |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
| 4786 |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
| 4787 |
|
|
| 4788 |
The first part of the pattern is a DEFINE group inside which a another |
The first part of the pattern is a DEFINE group inside which a another |
| 4789 |
group named "byte" is defined. This matches an individual component of |
group named "byte" is defined. This matches an individual component of |
| 4790 |
an IPv4 address (a number less than 256). When matching takes place, |
an IPv4 address (a number less than 256). When matching takes place, |
| 4791 |
this part of the pattern is skipped because DEFINE acts like a false |
this part of the pattern is skipped because DEFINE acts like a false |
| 4792 |
condition. The rest of the pattern uses references to the named group |
condition. The rest of the pattern uses references to the named group |
| 4793 |
to match the four dot-separated components of an IPv4 address, insist- |
to match the four dot-separated components of an IPv4 address, insist- |
| 4794 |
ing on a word boundary at each end. |
ing on a word boundary at each end. |
| 4795 |
|
|
| 4796 |
Assertion conditions |
Assertion conditions |
| 4797 |
|
|
| 4798 |
If the condition is not in any of the above formats, it must be an |
If the condition is not in any of the above formats, it must be an |
| 4799 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
| 4800 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
| 4801 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
| 4802 |
|
|
| 4803 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
| 4804 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
| 4805 |
|
|
| 4806 |
The condition is a positive lookahead assertion that matches an |
The condition is a positive lookahead assertion that matches an |
| 4807 |
optional sequence of non-letters followed by a letter. In other words, |
optional sequence of non-letters followed by a letter. In other words, |
| 4808 |
it tests for the presence of at least one letter in the subject. If a |
it tests for the presence of at least one letter in the subject. If a |
| 4809 |
letter is found, the subject is matched against the first alternative; |
letter is found, the subject is matched against the first alternative; |
| 4810 |
otherwise it is matched against the second. This pattern matches |
otherwise it is matched against the second. This pattern matches |
| 4811 |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
| 4812 |
letters and dd are digits. |
letters and dd are digits. |
| 4813 |
|
|
| 4814 |
|
|
| 4815 |
COMMENTS |
COMMENTS |
| 4816 |
|
|
| 4817 |
The sequence (?# marks the start of a comment that continues up to the |
The sequence (?# marks the start of a comment that continues up to the |
| 4818 |
next closing parenthesis. Nested parentheses are not permitted. The |
next closing parenthesis. Nested parentheses are not permitted. The |
| 4819 |
characters that make up a comment play no part in the pattern matching |
characters that make up a comment play no part in the pattern matching |
| 4820 |
at all. |
at all. |
| 4821 |
|
|
| 4822 |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
| 4823 |
character class introduces a comment that continues to immediately |
character class introduces a comment that continues to immediately |
| 4824 |
after the next newline in the pattern. |
after the next newline in the pattern. |
| 4825 |
|
|
| 4826 |
|
|
| 4827 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
| 4828 |
|
|
| 4829 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
| 4830 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
| 4831 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
| 4832 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
| 4833 |
depth. |
depth. |
| 4834 |
|
|
| 4835 |
For some time, Perl has provided a facility that allows regular expres- |
For some time, Perl has provided a facility that allows regular expres- |
| 4836 |
sions to recurse (amongst other things). It does this by interpolating |
sions to recurse (amongst other things). It does this by interpolating |
| 4837 |
Perl code in the expression at run time, and the code can refer to the |
Perl code in the expression at run time, and the code can refer to the |
| 4838 |
expression itself. A Perl pattern using code interpolation to solve the |
expression itself. A Perl pattern using code interpolation to solve the |
| 4839 |
parentheses problem can be created like this: |
parentheses problem can be created like this: |
| 4840 |
|
|
| 4844 |
refers recursively to the pattern in which it appears. |
refers recursively to the pattern in which it appears. |
| 4845 |
|
|
| 4846 |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
| 4847 |
it supports special syntax for recursion of the entire pattern, and |
it supports special syntax for recursion of the entire pattern, and |
| 4848 |
also for individual subpattern recursion. After its introduction in |
also for individual subpattern recursion. After its introduction in |
| 4849 |
PCRE and Python, this kind of recursion was subsequently introduced |
PCRE and Python, this kind of recursion was subsequently introduced |
| 4850 |
into Perl at release 5.10. |
into Perl at release 5.10. |
| 4851 |
|
|
| 4852 |
A special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
| 4853 |
zero and a closing parenthesis is a recursive call of the subpattern of |
zero and a closing parenthesis is a recursive call of the subpattern of |
| 4854 |
the given number, provided that it occurs inside that subpattern. (If |
the given number, provided that it occurs inside that subpattern. (If |
| 4855 |
not, it is a "subroutine" call, which is described in the next sec- |
not, it is a "subroutine" call, which is described in the next sec- |
| 4856 |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
| 4857 |
regular expression. |
regular expression. |
| 4858 |
|
|
| 4859 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
| 4860 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
| 4861 |
|
|
| 4862 |
\( ( [^()]++ | (?R) )* \) |
\( ( [^()]++ | (?R) )* \) |
| 4863 |
|
|
| 4864 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
| 4865 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
| 4866 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
| 4867 |
sized substring). Finally there is a closing parenthesis. Note the use |
sized substring). Finally there is a closing parenthesis. Note the use |
| 4868 |
of a possessive quantifier to avoid backtracking into sequences of non- |
of a possessive quantifier to avoid backtracking into sequences of non- |
| 4869 |
parentheses. |
parentheses. |
| 4870 |
|
|
| 4871 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
| 4872 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
| 4873 |
|
|
| 4874 |
( \( ( [^()]++ | (?1) )* \) ) |
( \( ( [^()]++ | (?1) )* \) ) |
| 4875 |
|
|
| 4876 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
| 4877 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
| 4878 |
|
|
| 4879 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
| 4880 |
tricky. This is made easier by the use of relative references (a Perl |
tricky. This is made easier by the use of relative references (a Perl |
| 4881 |
5.10 feature). Instead of (?1) in the pattern above you can write |
5.10 feature). Instead of (?1) in the pattern above you can write |
| 4882 |
(?-2) to refer to the second most recently opened parentheses preceding |
(?-2) to refer to the second most recently opened parentheses preceding |
| 4883 |
the recursion. In other words, a negative number counts capturing |
the recursion. In other words, a negative number counts capturing |
| 4884 |
parentheses leftwards from the point at which it is encountered. |
parentheses leftwards from the point at which it is encountered. |
| 4885 |
|
|
| 4886 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
| 4887 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
| 4888 |
because the reference is not inside the parentheses that are refer- |
because the reference is not inside the parentheses that are refer- |
| 4889 |
enced. They are always "subroutine" calls, as described in the next |
enced. They are always "subroutine" calls, as described in the next |
| 4890 |
section. |
section. |
| 4891 |
|
|
| 4892 |
An alternative approach is to use named parentheses instead. The Perl |
An alternative approach is to use named parentheses instead. The Perl |
| 4893 |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
| 4894 |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
| 4895 |
|
|
| 4896 |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
| 4897 |
|
|
| 4898 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
| 4899 |
one is used. |
one is used. |
| 4900 |
|
|
| 4901 |
This particular example pattern that we have been looking at contains |
This particular example pattern that we have been looking at contains |
| 4902 |
nested unlimited repeats, and so the use of a possessive quantifier for |
nested unlimited repeats, and so the use of a possessive quantifier for |
| 4903 |
matching strings of non-parentheses is important when applying the pat- |
matching strings of non-parentheses is important when applying the pat- |
| 4904 |
tern to strings that do not match. For example, when this pattern is |
tern to strings that do not match. For example, when this pattern is |
| 4905 |
applied to |
applied to |
| 4906 |
|
|
| 4907 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
| 4908 |
|
|
| 4909 |
it yields "no match" quickly. However, if a possessive quantifier is |
it yields "no match" quickly. However, if a possessive quantifier is |
| 4910 |
not used, the match runs for a very long time indeed because there are |
not used, the match runs for a very long time indeed because there are |
| 4911 |
so many different ways the + and * repeats can carve up the subject, |
so many different ways the + and * repeats can carve up the subject, |
| 4912 |
and all have to be tested before failure can be reported. |
and all have to be tested before failure can be reported. |
| 4913 |
|
|
| 4914 |
At the end of a match, the values of capturing parentheses are those |
At the end of a match, the values of capturing parentheses are those |
| 4915 |
from the outermost level. If you want to obtain intermediate values, a |
from the outermost level. If you want to obtain intermediate values, a |
| 4916 |
callout function can be used (see below and the pcrecallout documenta- |
callout function can be used (see below and the pcrecallout documenta- |
| 4917 |
tion). If the pattern above is matched against |
tion). If the pattern above is matched against |
| 4918 |
|
|
| 4919 |
(ab(cd)ef) |
(ab(cd)ef) |
| 4920 |
|
|
| 4921 |
the value for the inner capturing parentheses (numbered 2) is "ef", |
the value for the inner capturing parentheses (numbered 2) is "ef", |
| 4922 |
which is the last value taken on at the top level. If a capturing sub- |
which is the last value taken on at the top level. If a capturing sub- |
| 4923 |
pattern is not matched at the top level, its final value is unset, even |
pattern is not matched at the top level, its final value is unset, even |
| 4924 |
if it is (temporarily) set at a deeper level. |
if it is (temporarily) set at a deeper level. |
| 4925 |
|
|
| 4926 |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
| 4927 |
to obtain extra memory to store data during a recursion, which it does |
to obtain extra memory to store data during a recursion, which it does |
| 4928 |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
| 4929 |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
| 4930 |
|
|
| 4931 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
| 4932 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
| 4933 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
| 4934 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
| 4935 |
ted at the outer level. |
ted at the outer level. |
| 4936 |
|
|
| 4937 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
| 4938 |
|
|
| 4939 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
| 4940 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
| 4941 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
| 4942 |
|
|
| 4943 |
Recursion difference from Perl |
Recursion difference from Perl |
| 4944 |
|
|
| 4945 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
| 4946 |
always treated as an atomic group. That is, once it has matched some of |
always treated as an atomic group. That is, once it has matched some of |
| 4947 |
the subject string, it is never re-entered, even if it contains untried |
the subject string, it is never re-entered, even if it contains untried |
| 4948 |
alternatives and there is a subsequent matching failure. This can be |
alternatives and there is a subsequent matching failure. This can be |
| 4949 |
illustrated by the following pattern, which purports to match a palin- |
illustrated by the following pattern, which purports to match a palin- |
| 4950 |
dromic string that contains an odd number of characters (for example, |
dromic string that contains an odd number of characters (for example, |
| 4951 |
"a", "aba", "abcba", "abcdcba"): |
"a", "aba", "abcba", "abcdcba"): |
| 4952 |
|
|
| 4953 |
^(.|(.)(?1)\2)$ |
^(.|(.)(?1)\2)$ |
| 4954 |
|
|
| 4955 |
The idea is that it either matches a single character, or two identical |
The idea is that it either matches a single character, or two identical |
| 4956 |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
| 4957 |
in PCRE it does not if the pattern is longer than three characters. |
in PCRE it does not if the pattern is longer than three characters. |
| 4958 |
Consider the subject string "abcba": |
Consider the subject string "abcba": |
| 4959 |
|
|
| 4960 |
At the top level, the first character is matched, but as it is not at |
At the top level, the first character is matched, but as it is not at |
| 4961 |
the end of the string, the first alternative fails; the second alterna- |
the end of the string, the first alternative fails; the second alterna- |
| 4962 |
tive is taken and the recursion kicks in. The recursive call to subpat- |
tive is taken and the recursion kicks in. The recursive call to subpat- |
| 4963 |
tern 1 successfully matches the next character ("b"). (Note that the |
tern 1 successfully matches the next character ("b"). (Note that the |
| 4964 |
beginning and end of line tests are not part of the recursion). |
beginning and end of line tests are not part of the recursion). |
| 4965 |
|
|
| 4966 |
Back at the top level, the next character ("c") is compared with what |
Back at the top level, the next character ("c") is compared with what |
| 4967 |
subpattern 2 matched, which was "a". This fails. Because the recursion |
subpattern 2 matched, which was "a". This fails. Because the recursion |
| 4968 |
is treated as an atomic group, there are now no backtracking points, |
is treated as an atomic group, there are now no backtracking points, |
| 4969 |
and so the entire match fails. (Perl is able, at this point, to re- |
and so the entire match fails. (Perl is able, at this point, to re- |
| 4970 |
enter the recursion and try the second alternative.) However, if the |
enter the recursion and try the second alternative.) However, if the |
| 4971 |
pattern is written with the alternatives in the other order, things are |
pattern is written with the alternatives in the other order, things are |
| 4972 |
different: |
different: |
| 4973 |
|
|
| 4974 |
^((.)(?1)\2|.)$ |
^((.)(?1)\2|.)$ |
| 4975 |
|
|
| 4976 |
This time, the recursing alternative is tried first, and continues to |
This time, the recursing alternative is tried first, and continues to |
| 4977 |
recurse until it runs out of characters, at which point the recursion |
recurse until it runs out of characters, at which point the recursion |
| 4978 |
fails. But this time we do have another alternative to try at the |
fails. But this time we do have another alternative to try at the |
| 4979 |
higher level. That is the big difference: in the previous case the |
higher level. That is the big difference: in the previous case the |
| 4980 |
remaining alternative is at a deeper recursion level, which PCRE cannot |
remaining alternative is at a deeper recursion level, which PCRE cannot |
| 4981 |
use. |
use. |
| 4982 |
|
|
| 4983 |
To change the pattern so that matches all palindromic strings, not just |
To change the pattern so that matches all palindromic strings, not just |
| 4984 |
those with an odd number of characters, it is tempting to change the |
those with an odd number of characters, it is tempting to change the |
| 4985 |
pattern to this: |
pattern to this: |
| 4986 |
|
|
| 4987 |
^((.)(?1)\2|.?)$ |
^((.)(?1)\2|.?)$ |
| 4988 |
|
|
| 4989 |
Again, this works in Perl, but not in PCRE, and for the same reason. |
Again, this works in Perl, but not in PCRE, and for the same reason. |
| 4990 |
When a deeper recursion has matched a single character, it cannot be |
When a deeper recursion has matched a single character, it cannot be |
| 4991 |
entered again in order to match an empty string. The solution is to |
entered again in order to match an empty string. The solution is to |
| 4992 |
separate the two cases, and write out the odd and even cases as alter- |
separate the two cases, and write out the odd and even cases as alter- |
| 4993 |
natives at the higher level: |
natives at the higher level: |
| 4994 |
|
|
| 4995 |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
| 4996 |
|
|
| 4997 |
If you want to match typical palindromic phrases, the pattern has to |
If you want to match typical palindromic phrases, the pattern has to |
| 4998 |
ignore all non-word characters, which can be done like this: |
ignore all non-word characters, which can be done like this: |
| 4999 |
|
|
| 5000 |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
| 5001 |
|
|
| 5002 |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
| 5003 |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
| 5004 |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
| 5005 |
ing into sequences of non-word characters. Without this, PCRE takes a |
ing into sequences of non-word characters. Without this, PCRE takes a |
| 5006 |
great deal longer (ten times or more) to match typical phrases, and |
great deal longer (ten times or more) to match typical phrases, and |
| 5007 |
Perl takes so long that you think it has gone into a loop. |
Perl takes so long that you think it has gone into a loop. |
| 5008 |
|
|
| 5009 |
WARNING: The palindrome-matching patterns above work only if the sub- |
WARNING: The palindrome-matching patterns above work only if the sub- |
| 5010 |
ject string does not start with a palindrome that is shorter than the |
ject string does not start with a palindrome that is shorter than the |
| 5011 |
entire string. For example, although "abcba" is correctly matched, if |
entire string. For example, although "abcba" is correctly matched, if |
| 5012 |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
| 5013 |
then fails at top level because the end of the string does not follow. |
then fails at top level because the end of the string does not follow. |
| 5014 |
Once again, it cannot jump back into the recursion to try other alter- |
Once again, it cannot jump back into the recursion to try other alter- |
| 5015 |
natives, so the entire match fails. |
natives, so the entire match fails. |
| 5016 |
|
|
| 5017 |
|
|
| 5018 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
| 5019 |
|
|
| 5020 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
| 5021 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
| 5022 |
ates like a subroutine in a programming language. The "called" subpat- |
ates like a subroutine in a programming language. The "called" subpat- |
| 5023 |
tern may be defined before or after the reference. A numbered reference |
tern may be defined before or after the reference. A numbered reference |
| 5024 |
can be absolute or relative, as in these examples: |
can be absolute or relative, as in these examples: |
| 5025 |
|
|
| 5031 |
|
|
| 5032 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
| 5033 |
|
|
| 5034 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
| 5035 |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
| 5036 |
|
|
| 5037 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
| 5038 |
|
|
| 5039 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
| 5040 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
| 5041 |
above. |
above. |
| 5042 |
|
|
| 5043 |
Like recursive subpatterns, a subroutine call is always treated as an |
Like recursive subpatterns, a subroutine call is always treated as an |
| 5044 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
| 5045 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
| 5046 |
there is a subsequent matching failure. Any capturing parentheses that |
there is a subsequent matching failure. Any capturing parentheses that |
| 5047 |
are set during the subroutine call revert to their previous values |
are set during the subroutine call revert to their previous values |
| 5048 |
afterwards. |
afterwards. |
| 5049 |
|
|
| 5050 |
When a subpattern is used as a subroutine, processing options such as |
When a subpattern is used as a subroutine, processing options such as |
| 5051 |
case-independence are fixed when the subpattern is defined. They cannot |
case-independence are fixed when the subpattern is defined. They cannot |
| 5052 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
| 5053 |
|
|
| 5054 |
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
| 5055 |
|
|
| 5056 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
| 5057 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
| 5058 |
|
|
| 5059 |
|
|
| 5060 |
ONIGURUMA SUBROUTINE SYNTAX |
ONIGURUMA SUBROUTINE SYNTAX |
| 5061 |
|
|
| 5062 |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
| 5063 |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
| 5064 |
an alternative syntax for referencing a subpattern as a subroutine, |
an alternative syntax for referencing a subpattern as a subroutine, |
| 5065 |
possibly recursively. Here are two of the examples used above, rewrit- |
possibly recursively. Here are two of the examples used above, rewrit- |
| 5066 |
ten using this syntax: |
ten using this syntax: |
| 5067 |
|
|
| 5068 |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
| 5069 |
(sens|respons)e and \g'1'ibility |
(sens|respons)e and \g'1'ibility |
| 5070 |
|
|
| 5071 |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
| 5072 |
plus or a minus sign it is taken as a relative reference. For example: |
plus or a minus sign it is taken as a relative reference. For example: |
| 5073 |
|
|
| 5074 |
(abc)(?i:\g<-1>) |
(abc)(?i:\g<-1>) |
| 5075 |
|
|
| 5076 |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
| 5077 |
synonymous. The former is a back reference; the latter is a subroutine |
synonymous. The former is a back reference; the latter is a subroutine |
| 5078 |
call. |
call. |
| 5079 |
|
|
| 5080 |
|
|
| 5081 |
CALLOUTS |
CALLOUTS |
| 5082 |
|
|
| 5083 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
| 5084 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
| 5085 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
| 5086 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
| 5087 |
tion. |
tion. |
| 5088 |
|
|
| 5089 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
| 5090 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
| 5091 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
| 5092 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
| 5093 |
all calling out. |
all calling out. |
| 5094 |
|
|
| 5095 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
| 5096 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
| 5097 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
| 5098 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
| 5099 |
points: |
points: |
| 5100 |
|
|
| 5101 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
| 5102 |
|
|
| 5103 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
| 5104 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
| 5105 |
numbered 255. |
numbered 255. |
| 5106 |
|
|
| 5107 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
| 5108 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
| 5109 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
| 5110 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
| 5111 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
| 5112 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
| 5113 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
| 5114 |
|
|
| 5115 |
|
|
| 5116 |
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
| 5117 |
|
|
| 5118 |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
| 5119 |
which are described in the Perl documentation as "experimental and sub- |
which are described in the Perl documentation as "experimental and sub- |
| 5120 |
ject to change or removal in a future version of Perl". It goes on to |
ject to change or removal in a future version of Perl". It goes on to |
| 5121 |
say: "Their usage in production code should be noted to avoid problems |
say: "Their usage in production code should be noted to avoid problems |
| 5122 |
during upgrades." The same remarks apply to the PCRE features described |
during upgrades." The same remarks apply to the PCRE features described |
| 5123 |
in this section. |
in this section. |
| 5124 |
|
|
| 5125 |
Since these verbs are specifically related to backtracking, most of |
Since these verbs are specifically related to backtracking, most of |
| 5126 |
them can be used only when the pattern is to be matched using |
them can be used only when the pattern is to be matched using |
| 5127 |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
| 5128 |
(*FAIL), which behaves like a failing negative assertion, they cause an |
(*FAIL), which behaves like a failing negative assertion, they cause an |
| 5129 |
error if encountered by pcre_dfa_exec(). |
error if encountered by pcre_dfa_exec(). |
| 5130 |
|
|
| 5131 |
If any of these verbs are used in an assertion or subroutine subpattern |
If any of these verbs are used in an assertion or subroutine subpattern |
| 5132 |
(including recursive subpatterns), their effect is confined to that |
(including recursive subpatterns), their effect is confined to that |
| 5133 |
subpattern; it does not extend to the surrounding pattern. Note that |
subpattern; it does not extend to the surrounding pattern. Note that |
| 5134 |
such subpatterns are processed as anchored at the point where they are |
such subpatterns are processed as anchored at the point where they are |
| 5135 |
tested. |
tested. |
| 5136 |
|
|
| 5137 |
The new verbs make use of what was previously invalid syntax: an open- |
The new verbs make use of what was previously invalid syntax: an open- |
| 5138 |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
| 5139 |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
| 5140 |
its general form is just (*VERB). Any number of these verbs may occur |
its general form is just (*VERB). Any number of these verbs may occur |
| 5141 |
in a pattern. There are two kinds: |
in a pattern. There are two kinds: |
| 5142 |
|
|
| 5143 |
Verbs that act immediately |
Verbs that act immediately |
| 5146 |
|
|
| 5147 |
(*ACCEPT) |
(*ACCEPT) |
| 5148 |
|
|
| 5149 |
This verb causes the match to end successfully, skipping the remainder |
This verb causes the match to end successfully, skipping the remainder |
| 5150 |
of the pattern. When inside a recursion, only the innermost pattern is |
of the pattern. When inside a recursion, only the innermost pattern is |
| 5151 |
ended immediately. If (*ACCEPT) is inside capturing parentheses, the |
ended immediately. If (*ACCEPT) is inside capturing parentheses, the |
| 5152 |
data so far is captured. (This feature was added to PCRE at release |
data so far is captured. (This feature was added to PCRE at release |
| 5153 |
8.00.) For example: |
8.00.) For example: |
| 5154 |
|
|
| 5155 |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
| 5156 |
|
|
| 5157 |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
| 5158 |
tured by the outer parentheses. |
tured by the outer parentheses. |
| 5159 |
|
|
| 5160 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
| 5161 |
|
|
| 5162 |
This verb causes the match to fail, forcing backtracking to occur. It |
This verb causes the match to fail, forcing backtracking to occur. It |
| 5163 |
is equivalent to (?!) but easier to read. The Perl documentation notes |
is equivalent to (?!) but easier to read. The Perl documentation notes |
| 5164 |
that it is probably useful only when combined with (?{}) or (??{}). |
that it is probably useful only when combined with (?{}) or (??{}). |
| 5165 |
Those are, of course, Perl features that are not present in PCRE. The |
Those are, of course, Perl features that are not present in PCRE. The |
| 5166 |
nearest equivalent is the callout feature, as for example in this pat- |
nearest equivalent is the callout feature, as for example in this pat- |
| 5167 |
tern: |
tern: |
| 5168 |
|
|
| 5169 |
a+(?C)(*FAIL) |
a+(?C)(*FAIL) |
| 5170 |
|
|
| 5171 |
A match with the string "aaaa" always fails, but the callout is taken |
A match with the string "aaaa" always fails, but the callout is taken |
| 5172 |
before each backtrack happens (in this example, 10 times). |
before each backtrack happens (in this example, 10 times). |
| 5173 |
|
|
| 5174 |
Verbs that act after backtracking |
Verbs that act after backtracking |
| 5175 |
|
|
| 5176 |
The following verbs do nothing when they are encountered. Matching con- |
The following verbs do nothing when they are encountered. Matching con- |
| 5177 |
tinues with what follows, but if there is no subsequent match, a fail- |
tinues with what follows, but if there is no subsequent match, a fail- |
| 5178 |
ure is forced. The verbs differ in exactly what kind of failure |
ure is forced. The verbs differ in exactly what kind of failure |
| 5179 |
occurs. |
occurs. |
| 5180 |
|
|
| 5181 |
(*COMMIT) |
(*COMMIT) |
| 5182 |
|
|
| 5183 |
This verb causes the whole match to fail outright if the rest of the |
This verb causes the whole match to fail outright if the rest of the |
| 5184 |
pattern does not match. Even if the pattern is unanchored, no further |
pattern does not match. Even if the pattern is unanchored, no further |
| 5185 |
attempts to find a match by advancing the starting point take place. |
attempts to find a match by advancing the starting point take place. |
| 5186 |
Once (*COMMIT) has been passed, pcre_exec() is committed to finding a |
Once (*COMMIT) has been passed, pcre_exec() is committed to finding a |
| 5187 |
match at the current starting point, or not at all. For example: |
match at the current starting point, or not at all. For example: |
| 5188 |
|
|
| 5189 |
a+(*COMMIT)b |
a+(*COMMIT)b |
| 5190 |
|
|
| 5191 |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
| 5192 |
of dynamic anchor, or "I've started, so I must finish." |
of dynamic anchor, or "I've started, so I must finish." |
| 5193 |
|
|
| 5194 |
(*PRUNE) |
(*PRUNE) |
| 5195 |
|
|
| 5196 |
This verb causes the match to fail at the current position if the rest |
This verb causes the match to fail at the current position if the rest |
| 5197 |
of the pattern does not match. If the pattern is unanchored, the normal |
of the pattern does not match. If the pattern is unanchored, the normal |
| 5198 |
"bumpalong" advance to the next starting character then happens. Back- |
"bumpalong" advance to the next starting character then happens. Back- |
| 5199 |
tracking can occur as usual to the left of (*PRUNE), or when matching |
tracking can occur as usual to the left of (*PRUNE), or when matching |
| 5200 |
to the right of (*PRUNE), but if there is no match to the right, back- |
to the right of (*PRUNE), but if there is no match to the right, back- |
| 5201 |
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
| 5202 |
is just an alternative to an atomic group or possessive quantifier, but |
is just an alternative to an atomic group or possessive quantifier, but |
| 5203 |
there are some uses of (*PRUNE) that cannot be expressed in any other |
there are some uses of (*PRUNE) that cannot be expressed in any other |
| 5204 |
way. |
way. |
| 5205 |
|
|
| 5206 |
(*SKIP) |
(*SKIP) |
| 5207 |
|
|
| 5208 |
This verb is like (*PRUNE), except that if the pattern is unanchored, |
This verb is like (*PRUNE), except that if the pattern is unanchored, |
| 5209 |
the "bumpalong" advance is not to the next character, but to the posi- |
the "bumpalong" advance is not to the next character, but to the posi- |
| 5210 |
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
| 5211 |
that whatever text was matched leading up to it cannot be part of a |
that whatever text was matched leading up to it cannot be part of a |
| 5212 |
successful match. Consider: |
successful match. Consider: |
| 5213 |
|
|
| 5214 |
a+(*SKIP)b |
a+(*SKIP)b |
| 5215 |
|
|
| 5216 |
If the subject is "aaaac...", after the first match attempt fails |
If the subject is "aaaac...", after the first match attempt fails |
| 5217 |
(starting at the first character in the string), the starting point |
(starting at the first character in the string), the starting point |
| 5218 |
skips on to start the next attempt at "c". Note that a possessive quan- |
skips on to start the next attempt at "c". Note that a possessive quan- |
| 5219 |
tifer does not have the same effect as this example; although it would |
tifer does not have the same effect as this example; although it would |
| 5220 |
suppress backtracking during the first match attempt, the second |
suppress backtracking during the first match attempt, the second |
| 5221 |
attempt would start at the second character instead of skipping on to |
attempt would start at the second character instead of skipping on to |
| 5222 |
"c". |
"c". |
| 5223 |
|
|
| 5224 |
(*THEN) |
(*THEN) |
| 5225 |
|
|
| 5226 |
This verb causes a skip to the next alternation if the rest of the pat- |
This verb causes a skip to the next alternation if the rest of the pat- |
| 5227 |
tern does not match. That is, it cancels pending backtracking, but only |
tern does not match. That is, it cancels pending backtracking, but only |
| 5228 |
within the current alternation. Its name comes from the observation |
within the current alternation. Its name comes from the observation |
| 5229 |
that it can be used for a pattern-based if-then-else block: |
that it can be used for a pattern-based if-then-else block: |
| 5230 |
|
|
| 5231 |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
| 5232 |
|
|
| 5233 |
If the COND1 pattern matches, FOO is tried (and possibly further items |
If the COND1 pattern matches, FOO is tried (and possibly further items |
| 5234 |
after the end of the group if FOO succeeds); on failure the matcher |
after the end of the group if FOO succeeds); on failure the matcher |
| 5235 |
skips to the second alternative and tries COND2, without backtracking |
skips to the second alternative and tries COND2, without backtracking |
| 5236 |
into COND1. If (*THEN) is used outside of any alternation, it acts |
into COND1. If (*THEN) is used outside of any alternation, it acts |
| 5237 |
exactly like (*PRUNE). |
exactly like (*PRUNE). |
| 5238 |
|
|
| 5239 |
|
|
| 5251 |
|
|
| 5252 |
REVISION |
REVISION |
| 5253 |
|
|
| 5254 |
Last updated: 18 October 2009 |
Last updated: 11 January 2010 |
| 5255 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
| 5256 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
| 5257 |
|
|
| 5258 |
|
|