| 44 |
.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);" |
.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);" |
| 45 |
.PP |
.PP |
| 46 |
.br |
.br |
| 47 |
|
.B void pcre_free_substring(const char *\fIstringptr\fR); |
| 48 |
|
.PP |
| 49 |
|
.br |
| 50 |
|
.B void pcre_free_substring_list(const char **\fIstringptr\fR); |
| 51 |
|
.PP |
| 52 |
|
.br |
| 53 |
.B const unsigned char *pcre_maketables(void); |
.B const unsigned char *pcre_maketables(void); |
| 54 |
.PP |
.PP |
| 55 |
.br |
.br |
| 76 |
The PCRE library is a set of functions that implement regular expression |
The PCRE library is a set of functions that implement regular expression |
| 77 |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
| 78 |
differences (see below). The current implementation corresponds to Perl 5.005, |
differences (see below). The current implementation corresponds to Perl 5.005, |
| 79 |
with some additional features from the Perl development release. |
with some additional features from later versions. This includes some |
| 80 |
|
experimental, incomplete support for UTF-8 encoded strings. Details of exactly |
| 81 |
|
what is and what is not supported are given below. |
| 82 |
|
|
| 83 |
PCRE has its own native API, which is described in this document. There is also |
PCRE has its own native API, which is described in this document. There is also |
| 84 |
a set of wrapper functions that correspond to the POSIX regular expression API. |
a set of wrapper functions that correspond to the POSIX regular expression API. |
| 92 |
use these to include support for different releases. |
use these to include support for different releases. |
| 93 |
|
|
| 94 |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
| 95 |
are used for compiling and matching regular expressions, while |
are used for compiling and matching regular expressions. |
| 96 |
\fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
|
| 97 |
|
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
| 98 |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
| 99 |
captured substrings from a matched subject string. The function |
captured substrings from a matched subject string; \fBpcre_free_substring()\fR |
| 100 |
\fBpcre_maketables()\fR is used (optionally) to build a set of character tables |
and \fBpcre_free_substring_list()\fR are also provided, to free the memory used |
| 101 |
in the current locale for passing to \fBpcre_compile()\fR. |
for extracted strings. |
| 102 |
|
|
| 103 |
|
The function \fBpcre_maketables()\fR is used (optionally) to build a set of |
| 104 |
|
character tables in the current locale for passing to \fBpcre_compile()\fR. |
| 105 |
|
|
| 106 |
The function \fBpcre_fullinfo()\fR is used to find out information about a |
The function \fBpcre_fullinfo()\fR is used to find out information about a |
| 107 |
compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only |
compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only |
| 235 |
greedy by default, but become greedy if followed by "?". It is not compatible |
greedy by default, but become greedy if followed by "?". It is not compatible |
| 236 |
with Perl. It can also be set by a (?U) option setting within the pattern. |
with Perl. It can also be set by a (?U) option setting within the pattern. |
| 237 |
|
|
| 238 |
|
PCRE_UTF8 |
| 239 |
|
|
| 240 |
|
This option causes PCRE to regard both the pattern and the subject as strings |
| 241 |
|
of UTF-8 characters instead of just byte strings. However, it is available only |
| 242 |
|
if PCRE has been built to include UTF-8 support. If not, the use of this option |
| 243 |
|
provokes an error. Support for UTF-8 is new, experimental, and incomplete. |
| 244 |
|
Details of exactly what it entails are given below. |
| 245 |
|
|
| 246 |
|
|
| 247 |
.SH STUDYING A PATTERN |
.SH STUDYING A PATTERN |
| 248 |
When a pattern is going to be used several times, it is worth spending more |
When a pattern is going to be used several times, it is worth spending more |
| 343 |
|
|
| 344 |
Return information about the first character of any matched string, for a |
Return information about the first character of any matched string, for a |
| 345 |
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
| 346 |
such as (cat|cow|coyote), then it is returned in the integer pointed to by |
such as (cat|cow|coyote), it is returned in the integer pointed to by |
| 347 |
\fIwhere\fR. Otherwise, if either |
\fIwhere\fR. Otherwise, if either |
| 348 |
|
|
| 349 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
| 352 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
| 353 |
(if it were set, the pattern would be anchored), |
(if it were set, the pattern would be anchored), |
| 354 |
|
|
| 355 |
then -1 is returned, indicating that the pattern matches only at the |
-1 is returned, indicating that the pattern matches only at the start of a |
| 356 |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
subject string or after any "\\n" within the string. Otherwise -2 is returned. |
| 357 |
returned. For anchored patterns, -2 is returned. |
For anchored patterns, -2 is returned. |
| 358 |
|
|
| 359 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
| 360 |
|
|
| 570 |
were captured by the match, including the substring that matched the entire |
were captured by the match, including the substring that matched the entire |
| 571 |
regular expression. This is the value returned by \fBpcre_exec\fR if it |
regular expression. This is the value returned by \fBpcre_exec\fR if it |
| 572 |
is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it |
is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it |
| 573 |
ran out of space in \fIovector\fR, then the value passed as |
ran out of space in \fIovector\fR, the value passed as \fIstringcount\fR should |
| 574 |
\fIstringcount\fR should be the size of the vector divided by three. |
be the size of the vector divided by three. |
| 575 |
|
|
| 576 |
The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR |
The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR |
| 577 |
extract a single substring, whose number is given as \fIstringnumber\fR. A |
extract a single substring, whose number is given as \fIstringnumber\fR. A |
| 578 |
value of zero extracts the substring that matched the entire pattern, while |
value of zero extracts the substring that matched the entire pattern, while |
| 579 |
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR, |
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR, |
| 580 |
the string is placed in \fIbuffer\fR, whose length is given by |
the string is placed in \fIbuffer\fR, whose length is given by |
| 581 |
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is |
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is |
| 582 |
obtained via \fBpcre_malloc\fR, and its address is returned via |
obtained via \fBpcre_malloc\fR, and its address is returned via |
| 583 |
\fIstringptr\fR. The yield of the function is the length of the string, not |
\fIstringptr\fR. The yield of the function is the length of the string, not |
| 584 |
including the terminating zero, or one of |
including the terminating zero, or one of |
| 610 |
inspecting the appropriate offset in \fIovector\fR, which is negative for unset |
inspecting the appropriate offset in \fIovector\fR, which is negative for unset |
| 611 |
substrings. |
substrings. |
| 612 |
|
|
| 613 |
|
The two convenience functions \fBpcre_free_substring()\fR and |
| 614 |
|
\fBpcre_free_substring_list()\fR can be used to free the memory returned by |
| 615 |
|
a previous call of \fBpcre_get_substring()\fR or |
| 616 |
|
\fBpcre_get_substring_list()\fR, respectively. They do nothing more than call |
| 617 |
|
the function pointed to by \fBpcre_free\fR, which of course could be called |
| 618 |
|
directly from a C program. However, PCRE is used in some situations where it is |
| 619 |
|
linked via a special interface to another programming language which cannot use |
| 620 |
|
\fBpcre_free\fR directly; it is for these cases that the functions are |
| 621 |
|
provided. |
| 622 |
|
|
| 623 |
|
|
| 624 |
.SH LIMITATIONS |
.SH LIMITATIONS |
| 679 |
with the settings of captured strings when part of a pattern is repeated. For |
with the settings of captured strings when part of a pattern is repeated. For |
| 680 |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
| 681 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
| 682 |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set. |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set. |
| 683 |
|
|
| 684 |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
| 685 |
future Perl changes to a consistent state that is different, PCRE may change to |
future Perl changes to a consistent state that is different, PCRE may change to |
| 720 |
described below. Regular expressions are also described in the Perl |
described below. Regular expressions are also described in the Perl |
| 721 |
documentation and in a number of other books, some of which have copious |
documentation and in a number of other books, some of which have copious |
| 722 |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
| 723 |
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
| 724 |
here is intended as reference documentation. |
|
| 725 |
|
The description here is intended as reference documentation. The basic |
| 726 |
|
operation of PCRE is on strings of bytes. However, there is the beginnings of |
| 727 |
|
some support for UTF-8 character strings. To use this support you must |
| 728 |
|
configure PCRE to include it, and then call \fBpcre_compile()\fR with the |
| 729 |
|
PCRE_UTF8 option. How this affects the pattern matching is described in the |
| 730 |
|
final section of this document. |
| 731 |
|
|
| 732 |
A regular expression is a pattern that is matched against a subject string from |
A regular expression is a pattern that is matched against a subject string from |
| 733 |
left to right. Most characters stand for themselves in a pattern, and match the |
left to right. Most characters stand for themselves in a pattern, and match the |
| 955 |
.SH FULL STOP (PERIOD, DOT) |
.SH FULL STOP (PERIOD, DOT) |
| 956 |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
| 957 |
the subject, including a non-printing character, but not (by default) newline. |
the subject, including a non-printing character, but not (by default) newline. |
| 958 |
If the PCRE_DOTALL option is set, then dots match newlines as well. The |
If the PCRE_DOTALL option is set, dots match newlines as well. The handling of |
| 959 |
handling of dot is entirely independent of the handling of circumflex and |
dot is entirely independent of the handling of circumflex and dollar, the only |
| 960 |
dollar, the only relationship being that they both involve newline characters. |
relationship being that they both involve newline characters. Dot has no |
| 961 |
Dot has no special meaning in a character class. |
special meaning in a character class. |
| 962 |
|
|
| 963 |
|
|
| 964 |
.SH SQUARE BRACKETS |
.SH SQUARE BRACKETS |
| 1248 |
fails, because it matches the entire string due to the greediness of the .* |
fails, because it matches the entire string due to the greediness of the .* |
| 1249 |
item. |
item. |
| 1250 |
|
|
| 1251 |
However, if a quantifier is followed by a question mark, then it ceases to be |
However, if a quantifier is followed by a question mark, it ceases to be |
| 1252 |
greedy, and instead matches the minimum number of times possible, so the |
greedy, and instead matches the minimum number of times possible, so the |
| 1253 |
pattern |
pattern |
| 1254 |
|
|
| 1264 |
which matches one digit by preference, but can match two if that is the only |
which matches one digit by preference, but can match two if that is the only |
| 1265 |
way the rest of the pattern matches. |
way the rest of the pattern matches. |
| 1266 |
|
|
| 1267 |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl) |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl), |
| 1268 |
then the quantifiers are not greedy by default, but individual ones can be made |
the quantifiers are not greedy by default, but individual ones can be made |
| 1269 |
greedy by following them with a question mark. In other words, it inverts the |
greedy by following them with a question mark. In other words, it inverts the |
| 1270 |
default behaviour. |
default behaviour. |
| 1271 |
|
|
| 1274 |
compiled pattern, in proportion to the size of the minimum or maximum. |
compiled pattern, in proportion to the size of the minimum or maximum. |
| 1275 |
|
|
| 1276 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
| 1277 |
to Perl's /s) is set, thus allowing the . to match newlines, then the pattern |
to Perl's /s) is set, thus allowing the . to match newlines, the pattern is |
| 1278 |
is implicitly anchored, because whatever follows will be tried against every |
implicitly anchored, because whatever follows will be tried against every |
| 1279 |
character position in the subject string, so there is no point in retrying the |
character position in the subject string, so there is no point in retrying the |
| 1280 |
overall match at any position after the first. PCRE treats such a pattern as |
overall match at any position after the first. PCRE treats such a pattern as |
| 1281 |
though it were preceded by \\A. In cases where it is known that the subject |
though it were preceded by \\A. In cases where it is known that the subject |
| 1319 |
|
|
| 1320 |
matches "sense and sensibility" and "response and responsibility", but not |
matches "sense and sensibility" and "response and responsibility", but not |
| 1321 |
"sense and responsibility". If caseful matching is in force at the time of the |
"sense and responsibility". If caseful matching is in force at the time of the |
| 1322 |
back reference, then the case of letters is relevant. For example, |
back reference, the case of letters is relevant. For example, |
| 1323 |
|
|
| 1324 |
((?i)rah)\\s+\\1 |
((?i)rah)\\s+\\1 |
| 1325 |
|
|
| 1327 |
capturing subpattern is matched caselessly. |
capturing subpattern is matched caselessly. |
| 1328 |
|
|
| 1329 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
| 1330 |
subpattern has not actually been used in a particular match, then any back |
subpattern has not actually been used in a particular match, any back |
| 1331 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
| 1332 |
|
|
| 1333 |
(a|(bc))\\2 |
(a|(bc))\\2 |
| 1335 |
always fails if it starts to match "a" rather than "bc". Because there may be |
always fails if it starts to match "a" rather than "bc". Because there may be |
| 1336 |
up to 99 back references, all digits following the backslash are taken |
up to 99 back references, all digits following the backslash are taken |
| 1337 |
as part of a potential back reference number. If the pattern continues with a |
as part of a potential back reference number. If the pattern continues with a |
| 1338 |
digit character, then some delimiter must be used to terminate the back |
digit character, some delimiter must be used to terminate the back reference. |
| 1339 |
reference. If the PCRE_EXTENDED option is set, this can be whitespace. |
If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty |
| 1340 |
Otherwise an empty comment can be used. |
comment can be used. |
| 1341 |
|
|
| 1342 |
A back reference that occurs inside the parentheses to which it refers fails |
A back reference that occurs inside the parentheses to which it refers fails |
| 1343 |
when the subpattern is first used, so, for example, (a\\1) never matches. |
when the subpattern is first used, so, for example, (a\\1) never matches. |
| 1346 |
|
|
| 1347 |
(a|b\\1)+ |
(a|b\\1)+ |
| 1348 |
|
|
| 1349 |
matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of |
| 1350 |
the subpattern, the back reference matches the character string corresponding |
the subpattern, the back reference matches the character string corresponding |
| 1351 |
to the previous iteration. In order for this to work, the pattern must be such |
to the previous iteration. In order for this to work, the pattern must be such |
| 1352 |
that the first iteration does not need to match the back reference. This can be |
that the first iteration does not need to match the back reference. This can be |
| 1425 |
matches "foo" preceded by three digits that are not "999". Notice that each of |
matches "foo" preceded by three digits that are not "999". Notice that each of |
| 1426 |
the assertions is applied independently at the same point in the subject |
the assertions is applied independently at the same point in the subject |
| 1427 |
string. First there is a check that the previous three characters are all |
string. First there is a check that the previous three characters are all |
| 1428 |
digits, then there is a check that the same three characters are not "999". |
digits, and then there is a check that the same three characters are not "999". |
| 1429 |
This pattern does \fInot\fR match "foo" preceded by six characters, the first |
This pattern does \fInot\fR match "foo" preceded by six characters, the first |
| 1430 |
of which are digits and the last three of which are not "999". For example, it |
of which are digits and the last three of which are not "999". For example, it |
| 1431 |
doesn't match "123abcfoo". A pattern to do that is |
doesn't match "123abcfoo". A pattern to do that is |
| 1510 |
|
|
| 1511 |
^.*abcd$ |
^.*abcd$ |
| 1512 |
|
|
| 1513 |
then the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails (because |
| 1514 |
(because there is no following "a"), it backtracks to match all but the last |
there is no following "a"), it backtracks to match all but the last character, |
| 1515 |
character, then all but the last two characters, and so on. Once again the |
then all but the last two characters, and so on. Once again the search for "a" |
| 1516 |
search for "a" covers the entire string, from right to left, so we are no |
covers the entire string, from right to left, so we are no better off. However, |
| 1517 |
better off. However, if the pattern is written as |
if the pattern is written as |
| 1518 |
|
|
| 1519 |
^(?>.*)(?<=abcd) |
^(?>.*)(?<=abcd) |
| 1520 |
|
|
| 1521 |
then there can be no backtracking for the .* item; it can match only the entire |
there can be no backtracking for the .* item; it can match only the entire |
| 1522 |
string. The subsequent lookbehind assertion does a single test on the last four |
string. The subsequent lookbehind assertion does a single test on the last four |
| 1523 |
characters. If it fails, the match fails immediately. For long strings, this |
characters. If it fails, the match fails immediately. For long strings, this |
| 1524 |
approach makes a significant difference to the processing time. |
approach makes a significant difference to the processing time. |
| 1563 |
subpattern, a compile-time error occurs. |
subpattern, a compile-time error occurs. |
| 1564 |
|
|
| 1565 |
There are two kinds of condition. If the text between the parentheses consists |
There are two kinds of condition. If the text between the parentheses consists |
| 1566 |
of a sequence of digits, then the condition is satisfied if the capturing |
of a sequence of digits, the condition is satisfied if the capturing subpattern |
| 1567 |
subpattern of that number has previously matched. Consider the following |
of that number has previously matched. Consider the following pattern, which |
| 1568 |
pattern, which contains non-significant white space to make it more readable |
contains non-significant white space to make it more readable (assume the |
| 1569 |
(assume the PCRE_EXTENDED option) and to divide it into three parts for ease |
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: |
|
of discussion: |
|
| 1570 |
|
|
| 1571 |
( \\( )? [^()]+ (?(1) \\) ) |
( \\( )? [^()]+ (?(1) \\) ) |
| 1572 |
|
|
| 1656 |
\\( ( ( (?>[^()]+) | (?R) )* ) \\) |
\\( ( ( (?>[^()]+) | (?R) )* ) \\) |
| 1657 |
^ ^ |
^ ^ |
| 1658 |
^ ^ |
^ ^ |
| 1659 |
then the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
| 1660 |
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
| 1661 |
has to obtain extra memory to store data during a recursion, which it does by |
has to obtain extra memory to store data during a recursion, which it does by |
| 1662 |
using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no |
using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no |
| 1720 |
applied to a whole line of "a" characters, whereas the latter takes an |
applied to a whole line of "a" characters, whereas the latter takes an |
| 1721 |
appreciable time with strings longer than about 20 characters. |
appreciable time with strings longer than about 20 characters. |
| 1722 |
|
|
| 1723 |
|
|
| 1724 |
|
.SH UTF-8 SUPPORT |
| 1725 |
|
Starting at release 3.3, PCRE has some support for character strings encoded |
| 1726 |
|
in the UTF-8 format. This is incomplete, and is regarded as experimental. In |
| 1727 |
|
order to use it, you must configure PCRE to include UTF-8 support in the code, |
| 1728 |
|
and, in addition, you must call \fBpcre_compile()\fR with the PCRE_UTF8 option |
| 1729 |
|
flag. When you do this, both the pattern and any subject strings that are |
| 1730 |
|
matched against it are treated as UTF-8 strings instead of just strings of |
| 1731 |
|
bytes, but only in the cases that are mentioned below. |
| 1732 |
|
|
| 1733 |
|
If you compile PCRE with UTF-8 support, but do not use it at run time, the |
| 1734 |
|
library will be a bit bigger, but the additional run time overhead is limited |
| 1735 |
|
to testing the PCRE_UTF8 flag in several places, so should not be very large. |
| 1736 |
|
|
| 1737 |
|
PCRE assumes that the strings it is given contain valid UTF-8 codes. It does |
| 1738 |
|
not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE, |
| 1739 |
|
the results are undefined. |
| 1740 |
|
|
| 1741 |
|
Running with PCRE_UTF8 set causes these changes in the way PCRE works: |
| 1742 |
|
|
| 1743 |
|
1. In a pattern, the escape sequence \\x{...}, where the contents of the braces |
| 1744 |
|
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose |
| 1745 |
|
code number is the given hexadecimal number, for example: \\x{1234}. This |
| 1746 |
|
inserts from one to six literal bytes into the pattern, using the UTF-8 |
| 1747 |
|
encoding. If a non-hexadecimal digit appears between the braces, the item is |
| 1748 |
|
not recognized. |
| 1749 |
|
|
| 1750 |
|
2. The original hexadecimal escape sequence, \\xhh, generates a two-byte UTF-8 |
| 1751 |
|
character if its value is greater than 127. |
| 1752 |
|
|
| 1753 |
|
3. Repeat quantifiers are NOT correctly handled if they follow a multibyte |
| 1754 |
|
character. For example, \\x{100}* and \\xc3+ do not work. If you want to |
| 1755 |
|
repeat such characters, you must enclose them in non-capturing parentheses, |
| 1756 |
|
for example (?:\\x{100}), at present. |
| 1757 |
|
|
| 1758 |
|
4. The dot metacharacter matches one UTF-8 character instead of a single byte. |
| 1759 |
|
|
| 1760 |
|
5. Unlike literal UTF-8 characters, the dot metacharacter followed by a |
| 1761 |
|
repeat quantifier does operate correctly on UTF-8 characters instead of |
| 1762 |
|
single bytes. |
| 1763 |
|
|
| 1764 |
|
4. Although the \\x{...} escape is permitted in a character class, characters |
| 1765 |
|
whose values are greater than 255 cannot be included in a class. |
| 1766 |
|
|
| 1767 |
|
5. A class is matched against a UTF-8 character instead of just a single byte, |
| 1768 |
|
but it can match only characters whose values are less than 256. Characters |
| 1769 |
|
with greater values always fail to match a class. |
| 1770 |
|
|
| 1771 |
|
6. Repeated classes work correctly on multiple characters. |
| 1772 |
|
|
| 1773 |
|
7. Classes containing just a single character whose value is greater than 127 |
| 1774 |
|
(but less than 256), for example, [\\x80] or [^\\x{93}], do not work because |
| 1775 |
|
these are optimized into single byte matches. In the first case, of course, |
| 1776 |
|
the class brackets are just redundant. |
| 1777 |
|
|
| 1778 |
|
8. Lookbehind assertions move backwards in the subject by a fixed number of |
| 1779 |
|
characters instead of a fixed number of bytes. Simple cases have been tested |
| 1780 |
|
to work correctly, but there may be hidden gotchas herein. |
| 1781 |
|
|
| 1782 |
|
9. The character types such as \\d and \\w do not work correctly with UTF-8 |
| 1783 |
|
characters. They continue to test a single byte. |
| 1784 |
|
|
| 1785 |
|
10. Anything not explicitly mentioned here continues to work in bytes rather |
| 1786 |
|
than in characters. |
| 1787 |
|
|
| 1788 |
|
The following UTF-8 features of Perl 5.6 are not implemented: |
| 1789 |
|
|
| 1790 |
|
1. The escape sequence \\C to match a single byte. |
| 1791 |
|
|
| 1792 |
|
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X. |
| 1793 |
|
|
| 1794 |
.SH AUTHOR |
.SH AUTHOR |
| 1795 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
| 1796 |
.br |
.br |
| 1802 |
.br |
.br |
| 1803 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
| 1804 |
|
|
| 1805 |
Last updated: 27 January 2000 |
Last updated: 28 August 2000, |
| 1806 |
|
.br |
| 1807 |
|
the 250th anniversary of the death of J.S. Bach. |
| 1808 |
.br |
.br |
| 1809 |
Copyright (c) 1997-2000 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |