| 118 |
The following comments apply when PCRE is running in UTF-8 |
The following comments apply when PCRE is running in UTF-8 |
| 119 |
mode: |
mode: |
| 120 |
|
|
| 121 |
1. PCRE assumes that the strings it is given contain valid |
1. When you set the PCRE_UTF8 flag, the strings passed as |
| 122 |
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If |
patterns and subjects are checked for validity on entry to |
| 123 |
you pass invalid UTF-8 strings to PCRE, the results are |
the relevant functions. If an invalid UTF-8 string is |
| 124 |
undefined. |
passed, an error return is given. In some situations, you |
| 125 |
|
may already know that your strings are valid, and therefore |
| 126 |
|
want to skip these checks in order to improve performance. |
| 127 |
|
If you set the PCRE_NO_UTF8_CHECK flag at compile time or at |
| 128 |
|
run time, PCRE assumes that the pattern or subject it is |
| 129 |
|
given (respectively) contains only valid UTF-8 codes. In |
| 130 |
|
this case, it does not diagnose an invalid UTF-8 string. If |
| 131 |
|
you pass an invalid UTF-8 string to PCRE when |
| 132 |
|
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your |
| 133 |
|
program may crash. |
| 134 |
|
|
| 135 |
2. In a pattern, the escape sequence \x{...}, where the con- |
2. In a pattern, the escape sequence \x{...}, where the con- |
| 136 |
tents of the braces is a string of hexadecimal digits, is |
tents of the braces is a string of hexadecimal digits, is |
| 173 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
| 174 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
| 175 |
|
|
| 176 |
Last updated: 04 February 2003 |
Last updated: 20 August 2003 |
| 177 |
Copyright (c) 1997-2003 University of Cambridge. |
Copyright (c) 1997-2003 University of Cambridge. |
| 178 |
----------------------------------------------------------------------------- |
----------------------------------------------------------------------------- |
| 179 |
|
|
| 663 |
option changes the behaviour of PCRE are given in the sec- |
option changes the behaviour of PCRE are given in the sec- |
| 664 |
tion on UTF-8 support in the main pcre page. |
tion on UTF-8 support in the main pcre page. |
| 665 |
|
|
| 666 |
|
PCRE_NO_UTF8_CHECK |
| 667 |
|
|
| 668 |
|
When PCRE_UTF8 is set, the validity of the pattern as a |
| 669 |
|
UTF-8 string is automatically checked. If an invalid UTF-8 |
| 670 |
|
sequence of bytes is found, pcre_compile() returns an error. |
| 671 |
|
If you already know that your pattern is valid, and you want |
| 672 |
|
to skip this check for performance reasons, you can set the |
| 673 |
|
PCRE_NO_UTF8_CHECK option. When it is set, the effect of |
| 674 |
|
passing an invalid UTF-8 string as a pattern is undefined. |
| 675 |
|
It may cause your program to crash. Note that there is a |
| 676 |
|
similar option for suppressing the checking of subject |
| 677 |
|
strings passed to pcre_exec(). |
| 678 |
|
|
| 679 |
|
|
| 680 |
|
|
| 681 |
STUDYING A PATTERN |
STUDYING A PATTERN |
| 682 |
|
|
| 770 |
compiled pattern. It replaces the obsolete pcre_info() func- |
compiled pattern. It replaces the obsolete pcre_info() func- |
| 771 |
tion, which is nevertheless retained for backwards compabil- |
tion, which is nevertheless retained for backwards compabil- |
| 772 |
ity (and is documented below). |
ity (and is documented below). |
|
|
|
| 773 |
The first argument for pcre_fullinfo() is a pointer to the |
The first argument for pcre_fullinfo() is a pointer to the |
| 774 |
compiled pattern. The second argument is the result of |
compiled pattern. The second argument is the result of |
| 775 |
pcre_study(), or NULL if the pattern was not studied. The |
pcre_study(), or NULL if the pattern was not studied. The |
| 1036 |
turned out to be anchored by virtue of its contents, it can- |
turned out to be anchored by virtue of its contents, it can- |
| 1037 |
not be made unachored at matching time. |
not be made unachored at matching time. |
| 1038 |
|
|
| 1039 |
|
When PCRE_UTF8 was set at compile time, the validity of the |
| 1040 |
|
subject as a UTF-8 string is automatically checked. If an |
| 1041 |
|
invalid UTF-8 sequence of bytes is found, pcre_exec() |
| 1042 |
|
returns the error PCRE_ERROR_BADUTF8. If you already know |
| 1043 |
|
that your subject is valid, and you want to skip this check |
| 1044 |
|
for performance reasons, you can set the PCRE_NO_UTF8_CHECK |
| 1045 |
|
option when calling pcre_exec(). When this option is set, |
| 1046 |
|
the effect of passing an invalid UTF-8 string as a subject |
| 1047 |
|
is undefined. It may cause your program to crash. |
| 1048 |
|
|
| 1049 |
There are also three further options that can be set only at |
There are also three further options that can be set only at |
| 1050 |
matching time: |
matching time: |
| 1051 |
|
|
| 1135 |
used for a fragment of a pattern that picks out a substring. |
used for a fragment of a pattern that picks out a substring. |
| 1136 |
PCRE supports several other kinds of parenthesized subpat- |
PCRE supports several other kinds of parenthesized subpat- |
| 1137 |
tern that do not cause substrings to be captured. |
tern that do not cause substrings to be captured. |
|
|
|
| 1138 |
Captured substrings are returned to the caller via a vector |
Captured substrings are returned to the caller via a vector |
| 1139 |
of integer offsets whose address is passed in ovector. The |
of integer offsets whose address is passed in ovector. The |
| 1140 |
number of elements in the vector is passed in ovecsize. The |
number of elements in the vector is passed in ovecsize. The |
| 1250 |
distinctive error code. See the pcrecallout documentation |
distinctive error code. See the pcrecallout documentation |
| 1251 |
for details. |
for details. |
| 1252 |
|
|
| 1253 |
|
PCRE_ERROR_BADUTF8 (-10) |
| 1254 |
|
|
| 1255 |
|
A string that contains an invalid UTF-8 byte sequence was |
| 1256 |
|
passed as a subject. |
| 1257 |
|
|
| 1258 |
|
|
| 1259 |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
| 1260 |
|
|
| 1291 |
returned zero, indicating that it ran out of space in ovec- |
returned zero, indicating that it ran out of space in ovec- |
| 1292 |
tor, the value passed as stringcount should be the size of |
tor, the value passed as stringcount should be the size of |
| 1293 |
the vector divided by three. |
the vector divided by three. |
|
|
|
| 1294 |
The functions pcre_copy_substring() and pcre_get_substring() |
The functions pcre_copy_substring() and pcre_get_substring() |
| 1295 |
extract a single substring, whose number is given as string- |
extract a single substring, whose number is given as string- |
| 1296 |
number. A value of zero extracts the substring that matched |
number. A value of zero extracts the substring that matched |
| 1387 |
succeeds, they then call pcre_copy_substring() or |
succeeds, they then call pcre_copy_substring() or |
| 1388 |
pcre_get_substring(), as appropriate. |
pcre_get_substring(), as appropriate. |
| 1389 |
|
|
| 1390 |
Last updated: 03 February 2003 |
Last updated: 20 August 2003 |
| 1391 |
Copyright (c) 1997-2003 University of Cambridge. |
Copyright (c) 1997-2003 University of Cambridge. |
| 1392 |
----------------------------------------------------------------------------- |
----------------------------------------------------------------------------- |
| 1393 |
|
|
| 1455 |
The current_position field contains the offset within the |
The current_position field contains the offset within the |
| 1456 |
subject of the current match pointer. |
subject of the current match pointer. |
| 1457 |
|
|
| 1458 |
The capture_top field contains the number of the highest |
The capture_top field contains one more than the number of |
| 1459 |
captured substring so far. |
the highest numbered captured substring so far. If no sub- |
| 1460 |
|
strings have been captured, the value of capture_top is one. |
| 1461 |
|
|
| 1462 |
The capture_last field contains the number of the most |
The capture_last field contains the number of the most |
| 1463 |
recently captured substring. |
recently captured substring. |