| 268 |
\t tab (hex 09) |
\t tab (hex 09) |
| 269 |
\ddd character with octal code ddd, or back reference |
\ddd character with octal code ddd, or back reference |
| 270 |
\xhh character with hex code hh |
\xhh character with hex code hh |
| 271 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
| 272 |
|
\uhhhh character with hex code hhhh (JavaScript mode only) |
| 273 |
</pre> |
</pre> |
| 274 |
The precise effect of \cx is as follows: if x is a lower case letter, it |
The precise effect of \cx is as follows: if x is a lower case letter, it |
| 275 |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
| 281 |
0xc0 bits are flipped.) |
0xc0 bits are flipped.) |
| 282 |
</P> |
</P> |
| 283 |
<P> |
<P> |
| 284 |
After \x, from zero to two hexadecimal digits are read (letters can be in |
By default, after \x, from zero to two hexadecimal digits are read (letters |
| 285 |
upper or lower case). Any number of hexadecimal digits may appear between \x{ |
can be in upper or lower case). Any number of hexadecimal digits may appear |
| 286 |
and }, but the value of the character code must be less than 256 in non-UTF-8 |
between \x{ and }, but the value of the character code must be less than 256 |
| 287 |
mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in |
in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum |
| 288 |
hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code |
value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest |
| 289 |
point, which is 10FFFF. |
Unicode code point, which is 10FFFF. |
| 290 |
</P> |
</P> |
| 291 |
<P> |
<P> |
| 292 |
If characters other than hexadecimal digits appear between \x{ and }, or if |
If characters other than hexadecimal digits appear between \x{ and }, or if |
| 295 |
following digits, giving a character whose value is zero. |
following digits, giving a character whose value is zero. |
| 296 |
</P> |
</P> |
| 297 |
<P> |
<P> |
| 298 |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is |
| 299 |
|
as just described only when it is followed by two hexadecimal digits. |
| 300 |
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
| 301 |
|
code points greater than 256 is provided by \u, which must be followed by |
| 302 |
|
four hexadecimal digits; otherwise it matches a literal "u" character. |
| 303 |
|
</P> |
| 304 |
|
<P> |
| 305 |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
| 306 |
syntaxes for \x. There is no difference in the way they are handled. For |
syntaxes for \x (or by \u in JavaScript mode). There is no difference in the |
| 307 |
example, \xdc is exactly the same as \x{dc}. |
way they are handled. For example, \xdc is exactly the same as \x{dc} (or |
| 308 |
|
\u00dc in JavaScript mode). |
| 309 |
</P> |
</P> |
| 310 |
<P> |
<P> |
| 311 |
After \0 up to two further octal digits are read. If there are fewer than two |
After \0 up to two further octal digits are read. If there are fewer than two |
| 347 |
</P> |
</P> |
| 348 |
<P> |
<P> |
| 349 |
All the sequences that define a single character value can be used both inside |
All the sequences that define a single character value can be used both inside |
| 350 |
and outside character classes. In addition, inside a character class, the |
and outside character classes. In addition, inside a character class, \b is |
| 351 |
sequence \b is interpreted as the backspace character (hex 08). The sequences |
interpreted as the backspace character (hex 08). |
| 352 |
\B, \N, \R, and \X are not special inside a character class. Like any other |
</P> |
| 353 |
unrecognized escape sequences, they are treated as the literal characters "B", |
<P> |
| 354 |
"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is |
\N is not allowed in a character class. \B, \R, and \X are not special |
| 355 |
set. Outside a character class, these sequences have different meanings. |
inside a character class. Like other unrecognized escape sequences, they are |
| 356 |
|
treated as the literal characters "B", "R", and "X" by default, but cause an |
| 357 |
|
error if the PCRE_EXTRA option is set. Outside a character class, these |
| 358 |
|
sequences have different meanings. |
| 359 |
|
</P> |
| 360 |
|
<br><b> |
| 361 |
|
Unsupported escape sequences |
| 362 |
|
</b><br> |
| 363 |
|
<P> |
| 364 |
|
In Perl, the sequences \l, \L, \u, and \U are recognized by its string |
| 365 |
|
handler and used to modify the case of following characters. By default, PCRE |
| 366 |
|
does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT |
| 367 |
|
option is set, \U matches a "U" character, and \u can be used to define a |
| 368 |
|
character by code point, as described in the previous section. |
| 369 |
</P> |
</P> |
| 370 |
<br><b> |
<br><b> |
| 371 |
Absolute and relative back references |
Absolute and relative back references |
| 411 |
There is also the single sequence \N, which matches a non-newline character. |
There is also the single sequence \N, which matches a non-newline character. |
| 412 |
This is the same as |
This is the same as |
| 413 |
<a href="#fullstopdot">the "." metacharacter</a> |
<a href="#fullstopdot">the "." metacharacter</a> |
| 414 |
when PCRE_DOTALL is not set. |
when PCRE_DOTALL is not set. Perl also uses \N to match characters by name; |
| 415 |
|
PCRE does not support this. |
| 416 |
</P> |
</P> |
| 417 |
<P> |
<P> |
| 418 |
Each pair of lower and upper case escape sequences partitions the complete set |
Each pair of lower and upper case escape sequences partitions the complete set |
| 986 |
<P> |
<P> |
| 987 |
The escape sequence \N behaves like a dot, except that it is not affected by |
The escape sequence \N behaves like a dot, except that it is not affected by |
| 988 |
the PCRE_DOTALL option. In other words, it matches any character except one |
the PCRE_DOTALL option. In other words, it matches any character except one |
| 989 |
that signifies the end of a line. |
that signifies the end of a line. Perl also uses \N to match characters by |
| 990 |
|
name; PCRE does not support this. |
| 991 |
</P> |
</P> |
| 992 |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> |
| 993 |
<P> |
<P> |
| 1003 |
</P> |
</P> |
| 1004 |
<P> |
<P> |
| 1005 |
PCRE does not allow \C to appear in lookbehind assertions |
PCRE does not allow \C to appear in lookbehind assertions |
| 1006 |
<a href="#lookbehind">(described below),</a> |
<a href="#lookbehind">(described below)</a> |
| 1007 |
because in UTF-8 mode this would make it impossible to calculate the length of |
in UTF-8 mode, because this would make it impossible to calculate the length of |
| 1008 |
the lookbehind. |
the lookbehind. |
| 1009 |
</P> |
</P> |
| 1010 |
<P> |
<P> |
| 1950 |
assertion fails. |
assertion fails. |
| 1951 |
</P> |
</P> |
| 1952 |
<P> |
<P> |
| 1953 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode) |
In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte, |
| 1954 |
to appear in lookbehind assertions, because it makes it impossible to calculate |
even in UTF-8 mode) to appear in lookbehind assertions, because it makes it |
| 1955 |
the length of the lookbehind. The \X and \R escapes, which can match |
impossible to calculate the length of the lookbehind. The \X and \R escapes, |
| 1956 |
different numbers of bytes, are also not permitted. |
which can match different numbers of bytes, are also not permitted. |
| 1957 |
</P> |
</P> |
| 1958 |
<P> |
<P> |
| 1959 |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
| 2535 |
If any of these verbs are used in an assertion or in a subpattern that is |
If any of these verbs are used in an assertion or in a subpattern that is |
| 2536 |
called as a subroutine (whether or not recursively), their effect is confined |
called as a subroutine (whether or not recursively), their effect is confined |
| 2537 |
to that subpattern; it does not extend to the surrounding pattern, with one |
to that subpattern; it does not extend to the surrounding pattern, with one |
| 2538 |
exception: a *MARK that is encountered in a positive assertion <i>is</i> passed |
exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in |
| 2539 |
back (compare capturing parentheses in assertions). Note that such subpatterns |
a successful positive assertion <i>is</i> passed back when a match succeeds |
| 2540 |
are processed as anchored at the point where they are tested. Note also that |
(compare capturing parentheses in assertions). Note that such subpatterns are |
| 2541 |
Perl's treatment of subroutines is different in some cases. |
processed as anchored at the point where they are tested. Note also that Perl's |
| 2542 |
|
treatment of subroutines is different in some cases. |
| 2543 |
</P> |
</P> |
| 2544 |
<P> |
<P> |
| 2545 |
The new verbs make use of what was previously invalid syntax: an opening |
The new verbs make use of what was previously invalid syntax: an opening |
| 2561 |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
| 2562 |
pattern with (*NO_START_OPT). |
pattern with (*NO_START_OPT). |
| 2563 |
</P> |
</P> |
| 2564 |
|
<P> |
| 2565 |
|
Experiments with Perl suggest that it too has similar optimizations, sometimes |
| 2566 |
|
leading to anomalous results. |
| 2567 |
|
</P> |
| 2568 |
<br><b> |
<br><b> |
| 2569 |
Verbs that act immediately |
Verbs that act immediately |
| 2570 |
</b><br> |
</b><br> |
| 2612 |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
| 2613 |
</P> |
</P> |
| 2614 |
<P> |
<P> |
| 2615 |
When a match succeeds, the name of the last-encountered (*MARK) is passed back |
When a match succeeds, the name of the last-encountered (*MARK) on the matching |
| 2616 |
to the caller via the <i>pcre_extra</i> data structure, as described in the |
path is passed back to the caller via the <i>pcre_extra</i> data structure, as |
| 2617 |
|
described in the |
| 2618 |
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> |
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> |
| 2619 |
in the |
in the |
| 2620 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
| 2621 |
documentation. No data is returned for a partial match. Here is an example of |
documentation. Here is an example of <b>pcretest</b> output, where the /K |
| 2622 |
<b>pcretest</b> output, where the /K modifier requests the retrieval and |
modifier requests the retrieval and outputting of (*MARK) data: |
|
outputting of (*MARK) data: |
|
| 2623 |
<pre> |
<pre> |
| 2624 |
/X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
| 2625 |
XY |
data> XY |
| 2626 |
0: XY |
0: XY |
| 2627 |
MK: A |
MK: A |
| 2628 |
XZ |
XZ |
| 2640 |
assertions. |
assertions. |
| 2641 |
</P> |
</P> |
| 2642 |
<P> |
<P> |
| 2643 |
A name may also be returned after a failed match if the final path through the |
After a partial match or a failed match, the name of the last encountered |
| 2644 |
pattern involves (*MARK). However, unless (*MARK) used in conjunction with |
(*MARK) in the entire match process is returned. For example: |
|
(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the |
|
|
starting point for matching is advanced, the final check is often with an empty |
|
|
string, causing a failure before (*MARK) is reached. For example: |
|
| 2645 |
<pre> |
<pre> |
| 2646 |
/X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
| 2647 |
XP |
data> XP |
|
No match |
|
|
</pre> |
|
|
There are three potential starting points for this match (starting with X, |
|
|
starting with P, and with an empty string). If the pattern is anchored, the |
|
|
result is different: |
|
|
<pre> |
|
|
/^X(*MARK:A)Y|^X(*MARK:B)Z/K |
|
|
XP |
|
| 2648 |
No match, mark = B |
No match, mark = B |
| 2649 |
</pre> |
</pre> |
| 2650 |
PCRE's start-of-match optimizations can also interfere with this. For example, |
Note that in this unanchored example the mark is retained from the match |
| 2651 |
if, as a result of a call to <b>pcre_study()</b>, it knows the minimum |
attempt that started at the letter "X". Subsequent match attempts starting at |
| 2652 |
subject length for a match, a shorter subject will not be scanned at all. |
"P" and then with an empty string do not get as far as the (*MARK) item, but |
| 2653 |
</P> |
nevertheless do not reset it. |
|
<P> |
|
|
Note that similar anomalies (though different in detail) exist in Perl, no |
|
|
doubt for the same reasons. The use of (*MARK) data after a failed match of an |
|
|
unanchored pattern is not recommended, unless (*COMMIT) is involved. |
|
| 2654 |
</P> |
</P> |
| 2655 |
<br><b> |
<br><b> |
| 2656 |
Verbs that act after backtracking |
Verbs that act after backtracking |
| 2689 |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
| 2690 |
<b>pcretest</b> example: |
<b>pcretest</b> example: |
| 2691 |
<pre> |
<pre> |
| 2692 |
/(*COMMIT)abc/ |
re> /(*COMMIT)abc/ |
| 2693 |
xyzabc |
data> xyzabc |
| 2694 |
0: abc |
0: abc |
| 2695 |
xyzabc\Y |
xyzabc\Y |
| 2696 |
No match |
No match |
| 2711 |
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of |
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of |
| 2712 |
(*PRUNE) is just an alternative to an atomic group or possessive quantifier, |
(*PRUNE) is just an alternative to an atomic group or possessive quantifier, |
| 2713 |
but there are some uses of (*PRUNE) that cannot be expressed in any other way. |
but there are some uses of (*PRUNE) that cannot be expressed in any other way. |
| 2714 |
The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the |
The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an |
| 2715 |
match fails completely; the name is passed back if this is the final attempt. |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). |
|
(*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored |
|
|
pattern (*PRUNE) has the same effect as (*COMMIT). |
|
| 2716 |
<pre> |
<pre> |
| 2717 |
(*SKIP) |
(*SKIP) |
| 2718 |
</pre> |
</pre> |
| 2738 |
searched for the most recent (*MARK) that has the same name. If one is found, |
searched for the most recent (*MARK) that has the same name. If one is found, |
| 2739 |
the "bumpalong" advance is to the subject position that corresponds to that |
the "bumpalong" advance is to the subject position that corresponds to that |
| 2740 |
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a |
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a |
| 2741 |
matching name is found, normal "bumpalong" of one character happens (that is, |
matching name is found, the (*SKIP) is ignored. |
|
the (*SKIP) is ignored). |
|
| 2742 |
<pre> |
<pre> |
| 2743 |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
| 2744 |
</pre> |
</pre> |
| 2752 |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
| 2753 |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
| 2754 |
second alternative and tries COND2, without backtracking into COND1. The |
second alternative and tries COND2, without backtracking into COND1. The |
| 2755 |
behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the |
behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). |
| 2756 |
overall match fails. If (*THEN) is not inside an alternation, it acts like |
If (*THEN) is not inside an alternation, it acts like (*PRUNE). |
|
(*PRUNE). |
|
| 2757 |
</P> |
</P> |
| 2758 |
<P> |
<P> |
| 2759 |
Note that a subpattern that does not contain a | character is just a part of |
Note that a subpattern that does not contain a | character is just a part of |
| 2829 |
</P> |
</P> |
| 2830 |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
| 2831 |
<P> |
<P> |
| 2832 |
Last updated: 19 October 2011 |
Last updated: 29 November 2011 |
| 2833 |
<br> |
<br> |
| 2834 |
Copyright © 1997-2011 University of Cambridge. |
Copyright © 1997-2011 University of Cambridge. |
| 2835 |
<br> |
<br> |