| 268 |
\t tab (hex 09) |
\t tab (hex 09) |
| 269 |
\ddd character with octal code ddd, or back reference |
\ddd character with octal code ddd, or back reference |
| 270 |
\xhh character with hex code hh |
\xhh character with hex code hh |
| 271 |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\x{hhh..} character with hex code hhh.. |
|
\uhhhh character with hex code hhhh (JavaScript mode only) |
|
| 272 |
</pre> |
</pre> |
| 273 |
The precise effect of \cx is as follows: if x is a lower case letter, it |
The precise effect of \cx is as follows: if x is a lower case letter, it |
| 274 |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
| 280 |
0xc0 bits are flipped.) |
0xc0 bits are flipped.) |
| 281 |
</P> |
</P> |
| 282 |
<P> |
<P> |
| 283 |
By default, after \x, from zero to two hexadecimal digits are read (letters |
After \x, from zero to two hexadecimal digits are read (letters can be in |
| 284 |
can be in upper or lower case). Any number of hexadecimal digits may appear |
upper or lower case). Any number of hexadecimal digits may appear between \x{ |
| 285 |
between \x{ and }, but the value of the character code must be less than 256 |
and }, but the value of the character code must be less than 256 in non-UTF-8 |
| 286 |
in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum |
mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in |
| 287 |
value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest |
hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code |
| 288 |
Unicode code point, which is 10FFFF. |
point, which is 10FFFF. |
| 289 |
</P> |
</P> |
| 290 |
<P> |
<P> |
| 291 |
If characters other than hexadecimal digits appear between \x{ and }, or if |
If characters other than hexadecimal digits appear between \x{ and }, or if |
| 294 |
following digits, giving a character whose value is zero. |
following digits, giving a character whose value is zero. |
| 295 |
</P> |
</P> |
| 296 |
<P> |
<P> |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is |
|
|
as just described only when it is followed by two hexadecimal digits. |
|
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
|
|
code points greater than 256 is provided by \u, which must be followed by |
|
|
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
|
</P> |
|
|
<P> |
|
| 297 |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
| 298 |
syntaxes for \x (or by \u in JavaScript mode). There is no difference in the |
syntaxes for \x. There is no difference in the way they are handled. For |
| 299 |
way they are handled. For example, \xdc is exactly the same as \x{dc} (or |
example, \xdc is exactly the same as \x{dc}. |
|
\u00dc in JavaScript mode). |
|
| 300 |
</P> |
</P> |
| 301 |
<P> |
<P> |
| 302 |
After \0 up to two further octal digits are read. If there are fewer than two |
After \0 up to two further octal digits are read. If there are fewer than two |
| 338 |
</P> |
</P> |
| 339 |
<P> |
<P> |
| 340 |
All the sequences that define a single character value can be used both inside |
All the sequences that define a single character value can be used both inside |
| 341 |
and outside character classes. In addition, inside a character class, \b is |
and outside character classes. In addition, inside a character class, the |
| 342 |
interpreted as the backspace character (hex 08). |
sequence \b is interpreted as the backspace character (hex 08). The sequences |
| 343 |
</P> |
\B, \N, \R, and \X are not special inside a character class. Like any other |
| 344 |
<P> |
unrecognized escape sequences, they are treated as the literal characters "B", |
| 345 |
\N is not allowed in a character class. \B, \R, and \X are not special |
"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is |
| 346 |
inside a character class. Like other unrecognized escape sequences, they are |
set. Outside a character class, these sequences have different meanings. |
|
treated as the literal characters "B", "R", and "X" by default, but cause an |
|
|
error if the PCRE_EXTRA option is set. Outside a character class, these |
|
|
sequences have different meanings. |
|
|
</P> |
|
|
<br><b> |
|
|
Unsupported escape sequences |
|
|
</b><br> |
|
|
<P> |
|
|
In Perl, the sequences \l, \L, \u, and \U are recognized by its string |
|
|
handler and used to modify the case of following characters. By default, PCRE |
|
|
does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT |
|
|
option is set, \U matches a "U" character, and \u can be used to define a |
|
|
character by code point, as described in the previous section. |
|
| 347 |
</P> |
</P> |
| 348 |
<br><b> |
<br><b> |
| 349 |
Absolute and relative back references |
Absolute and relative back references |
| 389 |
There is also the single sequence \N, which matches a non-newline character. |
There is also the single sequence \N, which matches a non-newline character. |
| 390 |
This is the same as |
This is the same as |
| 391 |
<a href="#fullstopdot">the "." metacharacter</a> |
<a href="#fullstopdot">the "." metacharacter</a> |
| 392 |
when PCRE_DOTALL is not set. Perl also uses \N to match characters by name; |
when PCRE_DOTALL is not set. |
|
PCRE does not support this. |
|
| 393 |
</P> |
</P> |
| 394 |
<P> |
<P> |
| 395 |
Each pair of lower and upper case escape sequences partitions the complete set |
Each pair of lower and upper case escape sequences partitions the complete set |
| 963 |
<P> |
<P> |
| 964 |
The escape sequence \N behaves like a dot, except that it is not affected by |
The escape sequence \N behaves like a dot, except that it is not affected by |
| 965 |
the PCRE_DOTALL option. In other words, it matches any character except one |
the PCRE_DOTALL option. In other words, it matches any character except one |
| 966 |
that signifies the end of a line. Perl also uses \N to match characters by |
that signifies the end of a line. |
|
name; PCRE does not support this. |
|
| 967 |
</P> |
</P> |
| 968 |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> |
| 969 |
<P> |
<P> |
| 979 |
</P> |
</P> |
| 980 |
<P> |
<P> |
| 981 |
PCRE does not allow \C to appear in lookbehind assertions |
PCRE does not allow \C to appear in lookbehind assertions |
| 982 |
<a href="#lookbehind">(described below)</a> |
<a href="#lookbehind">(described below),</a> |
| 983 |
in UTF-8 mode, because this would make it impossible to calculate the length of |
because in UTF-8 mode this would make it impossible to calculate the length of |
| 984 |
the lookbehind. |
the lookbehind. |
| 985 |
</P> |
</P> |
| 986 |
<P> |
<P> |
| 1926 |
assertion fails. |
assertion fails. |
| 1927 |
</P> |
</P> |
| 1928 |
<P> |
<P> |
| 1929 |
In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte, |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode) |
| 1930 |
even in UTF-8 mode) to appear in lookbehind assertions, because it makes it |
to appear in lookbehind assertions, because it makes it impossible to calculate |
| 1931 |
impossible to calculate the length of the lookbehind. The \X and \R escapes, |
the length of the lookbehind. The \X and \R escapes, which can match |
| 1932 |
which can match different numbers of bytes, are also not permitted. |
different numbers of bytes, are also not permitted. |
| 1933 |
</P> |
</P> |
| 1934 |
<P> |
<P> |
| 1935 |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
| 2511 |
If any of these verbs are used in an assertion or in a subpattern that is |
If any of these verbs are used in an assertion or in a subpattern that is |
| 2512 |
called as a subroutine (whether or not recursively), their effect is confined |
called as a subroutine (whether or not recursively), their effect is confined |
| 2513 |
to that subpattern; it does not extend to the surrounding pattern, with one |
to that subpattern; it does not extend to the surrounding pattern, with one |
| 2514 |
exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in |
exception: a *MARK that is encountered in a positive assertion <i>is</i> passed |
| 2515 |
a successful positive assertion <i>is</i> passed back when a match succeeds |
back (compare capturing parentheses in assertions). Note that such subpatterns |
| 2516 |
(compare capturing parentheses in assertions). Note that such subpatterns are |
are processed as anchored at the point where they are tested. Note also that |
| 2517 |
processed as anchored at the point where they are tested. Note also that Perl's |
Perl's treatment of subroutines is different in some cases. |
|
treatment of subroutines is different in some cases. |
|
| 2518 |
</P> |
</P> |
| 2519 |
<P> |
<P> |
| 2520 |
The new verbs make use of what was previously invalid syntax: an opening |
The new verbs make use of what was previously invalid syntax: an opening |
| 2536 |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
| 2537 |
pattern with (*NO_START_OPT). |
pattern with (*NO_START_OPT). |
| 2538 |
</P> |
</P> |
|
<P> |
|
|
Experiments with Perl suggest that it too has similar optimizations, sometimes |
|
|
leading to anomalous results. |
|
|
</P> |
|
| 2539 |
<br><b> |
<br><b> |
| 2540 |
Verbs that act immediately |
Verbs that act immediately |
| 2541 |
</b><br> |
</b><br> |
| 2583 |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
| 2584 |
</P> |
</P> |
| 2585 |
<P> |
<P> |
| 2586 |
When a match succeeds, the name of the last-encountered (*MARK) on the matching |
When a match succeeds, the name of the last-encountered (*MARK) is passed back |
| 2587 |
path is passed back to the caller via the <i>pcre_extra</i> data structure, as |
to the caller via the <i>pcre_extra</i> data structure, as described in the |
|
described in the |
|
| 2588 |
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> |
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> |
| 2589 |
in the |
in the |
| 2590 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
| 2591 |
documentation. Here is an example of <b>pcretest</b> output, where the /K |
documentation. No data is returned for a partial match. Here is an example of |
| 2592 |
modifier requests the retrieval and outputting of (*MARK) data: |
<b>pcretest</b> output, where the /K modifier requests the retrieval and |
| 2593 |
|
outputting of (*MARK) data: |
| 2594 |
<pre> |
<pre> |
| 2595 |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
/X(*MARK:A)Y|X(*MARK:B)Z/K |
| 2596 |
data> XY |
XY |
| 2597 |
0: XY |
0: XY |
| 2598 |
MK: A |
MK: A |
| 2599 |
XZ |
XZ |
| 2611 |
assertions. |
assertions. |
| 2612 |
</P> |
</P> |
| 2613 |
<P> |
<P> |
| 2614 |
After a partial match or a failed match, the name of the last encountered |
A name may also be returned after a failed match if the final path through the |
| 2615 |
(*MARK) in the entire match process is returned. For example: |
pattern involves (*MARK). However, unless (*MARK) used in conjunction with |
| 2616 |
|
(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the |
| 2617 |
|
starting point for matching is advanced, the final check is often with an empty |
| 2618 |
|
string, causing a failure before (*MARK) is reached. For example: |
| 2619 |
<pre> |
<pre> |
| 2620 |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
/X(*MARK:A)Y|X(*MARK:B)Z/K |
| 2621 |
data> XP |
XP |
| 2622 |
|
No match |
| 2623 |
|
</pre> |
| 2624 |
|
There are three potential starting points for this match (starting with X, |
| 2625 |
|
starting with P, and with an empty string). If the pattern is anchored, the |
| 2626 |
|
result is different: |
| 2627 |
|
<pre> |
| 2628 |
|
/^X(*MARK:A)Y|^X(*MARK:B)Z/K |
| 2629 |
|
XP |
| 2630 |
No match, mark = B |
No match, mark = B |
| 2631 |
</pre> |
</pre> |
| 2632 |
Note that in this unanchored example the mark is retained from the match |
PCRE's start-of-match optimizations can also interfere with this. For example, |
| 2633 |
attempt that started at the letter "X". Subsequent match attempts starting at |
if, as a result of a call to <b>pcre_study()</b>, it knows the minimum |
| 2634 |
"P" and then with an empty string do not get as far as the (*MARK) item, but |
subject length for a match, a shorter subject will not be scanned at all. |
| 2635 |
nevertheless do not reset it. |
</P> |
| 2636 |
|
<P> |
| 2637 |
|
Note that similar anomalies (though different in detail) exist in Perl, no |
| 2638 |
|
doubt for the same reasons. The use of (*MARK) data after a failed match of an |
| 2639 |
|
unanchored pattern is not recommended, unless (*COMMIT) is involved. |
| 2640 |
</P> |
</P> |
| 2641 |
<br><b> |
<br><b> |
| 2642 |
Verbs that act after backtracking |
Verbs that act after backtracking |
| 2675 |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
| 2676 |
<b>pcretest</b> example: |
<b>pcretest</b> example: |
| 2677 |
<pre> |
<pre> |
| 2678 |
re> /(*COMMIT)abc/ |
/(*COMMIT)abc/ |
| 2679 |
data> xyzabc |
xyzabc |
| 2680 |
0: abc |
0: abc |
| 2681 |
xyzabc\Y |
xyzabc\Y |
| 2682 |
No match |
No match |
| 2697 |
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of |
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of |
| 2698 |
(*PRUNE) is just an alternative to an atomic group or possessive quantifier, |
(*PRUNE) is just an alternative to an atomic group or possessive quantifier, |
| 2699 |
but there are some uses of (*PRUNE) that cannot be expressed in any other way. |
but there are some uses of (*PRUNE) that cannot be expressed in any other way. |
| 2700 |
The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an |
The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the |
| 2701 |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). |
match fails completely; the name is passed back if this is the final attempt. |
| 2702 |
|
(*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored |
| 2703 |
|
pattern (*PRUNE) has the same effect as (*COMMIT). |
| 2704 |
<pre> |
<pre> |
| 2705 |
(*SKIP) |
(*SKIP) |
| 2706 |
</pre> |
</pre> |
| 2726 |
searched for the most recent (*MARK) that has the same name. If one is found, |
searched for the most recent (*MARK) that has the same name. If one is found, |
| 2727 |
the "bumpalong" advance is to the subject position that corresponds to that |
the "bumpalong" advance is to the subject position that corresponds to that |
| 2728 |
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a |
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a |
| 2729 |
matching name is found, the (*SKIP) is ignored. |
matching name is found, normal "bumpalong" of one character happens (that is, |
| 2730 |
|
the (*SKIP) is ignored). |
| 2731 |
<pre> |
<pre> |
| 2732 |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
| 2733 |
</pre> |
</pre> |
| 2741 |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
| 2742 |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
| 2743 |
second alternative and tries COND2, without backtracking into COND1. The |
second alternative and tries COND2, without backtracking into COND1. The |
| 2744 |
behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). |
behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the |
| 2745 |
If (*THEN) is not inside an alternation, it acts like (*PRUNE). |
overall match fails. If (*THEN) is not inside an alternation, it acts like |
| 2746 |
|
(*PRUNE). |
| 2747 |
</P> |
</P> |
| 2748 |
<P> |
<P> |
| 2749 |
Note that a subpattern that does not contain a | character is just a part of |
Note that a subpattern that does not contain a | character is just a part of |
| 2819 |
</P> |
</P> |
| 2820 |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
| 2821 |
<P> |
<P> |
| 2822 |
Last updated: 29 November 2011 |
Last updated: 19 October 2011 |
| 2823 |
<br> |
<br> |
| 2824 |
Copyright © 1997-2011 University of Cambridge. |
Copyright © 1997-2011 University of Cambridge. |
| 2825 |
<br> |
<br> |