| 436 |
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns |
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns |
| 437 |
NULL, and sets the variable pointed to by \fIerrptr\fP to point to a textual |
NULL, and sets the variable pointed to by \fIerrptr\fP to point to a textual |
| 438 |
error message. This is a static string that is part of the library. You must |
error message. This is a static string that is part of the library. You must |
| 439 |
not try to free it. The offset from the start of the pattern to the byte that |
not try to free it. Normally, the offset from the start of the pattern to the |
| 440 |
was being processed when the error was discovered is placed in the variable |
byte that was being processed when the error was discovered is placed in the |
| 441 |
pointed to by \fIerroffset\fP, which must not be NULL. If it is, an immediate |
variable pointed to by \fIerroffset\fP, which must not be NULL (if it is, an |
| 442 |
error is given. Some errors are not detected until checks are carried out when |
immediate error is given). However, for an invalid UTF-8 string, the offset is |
| 443 |
the whole pattern has been scanned; in this case the offset is set to the end |
that of the first byte of the failing character. Also, some errors are not |
| 444 |
of the pattern. |
detected until checks are carried out when the whole pattern has been scanned; |
| 445 |
|
in these cases the offset passed back is the length of the pattern. |
| 446 |
.P |
.P |
| 447 |
Note that the offset is in bytes, not characters, even in UTF-8 mode. It may |
Note that the offset is in bytes, not characters, even in UTF-8 mode. It may |
| 448 |
point into the middle of a UTF-8 character (for example, when |
sometimes point into the middle of a UTF-8 character. |
|
PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string). |
|
| 449 |
.P |
.P |
| 450 |
If \fBpcre_compile2()\fP is used instead of \fBpcre_compile()\fP, and the |
If \fBpcre_compile2()\fP is used instead of \fBpcre_compile()\fP, and the |
| 451 |
\fIerrorcodeptr\fP argument is not NULL, a non-zero error code number is |
\fIerrorcodeptr\fP argument is not NULL, a non-zero error code number is |
| 552 |
ignored. This is equivalent to Perl's /x option, and it can be changed within a |
ignored. This is equivalent to Perl's /x option, and it can be changed within a |
| 553 |
pattern by a (?x) option setting. |
pattern by a (?x) option setting. |
| 554 |
.P |
.P |
| 555 |
Which characters are interpreted as newlines |
Which characters are interpreted as newlines is controlled by the options |
| 556 |
is controlled by the options passed to \fBpcre_compile()\fP or by a special |
passed to \fBpcre_compile()\fP or by a special sequence at the start of the |
| 557 |
sequence at the start of the pattern, as described in the section entitled |
pattern, as described in the section entitled |
| 558 |
.\" HTML <a href="pcrepattern.html#newlines"> |
.\" HTML <a href="pcrepattern.html#newlines"> |
| 559 |
.\" </a> |
.\" </a> |
| 560 |
"Newline conventions" |
"Newline conventions" |
| 946 |
below in the section on matching a pattern. |
below in the section on matching a pattern. |
| 947 |
. |
. |
| 948 |
. |
. |
| 949 |
|
.\" HTML <a name="infoaboutpattern"></a> |
| 950 |
.SH "INFORMATION ABOUT A PATTERN" |
.SH "INFORMATION ABOUT A PATTERN" |
| 951 |
.rs |
.rs |
| 952 |
.sp |
.sp |
| 1548 |
.\" |
.\" |
| 1549 |
page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns |
page. If an invalid UTF-8 sequence of bytes is found, \fBpcre_exec()\fP returns |
| 1550 |
the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is |
the error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is |
| 1551 |
a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. If |
a truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORTUTF8. In |
| 1552 |
\fIstartoffset\fP contains a value that does not point to the start of a UTF-8 |
both cases, information about the precise nature of the error may also be |
| 1553 |
character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is |
returned (see the descriptions of these errors in the section entitled \fIError |
| 1554 |
|
return values from\fP \fBpcre_exec()\fP |
| 1555 |
|
.\" HTML <a href="#errorlist"> |
| 1556 |
|
.\" </a> |
| 1557 |
|
below). |
| 1558 |
|
.\" |
| 1559 |
|
If \fIstartoffset\fP contains a value that does not point to the start of a |
| 1560 |
|
UTF-8 character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is |
| 1561 |
returned. |
returned. |
| 1562 |
.P |
.P |
| 1563 |
If you already know that your subject is valid, and you want to skip these |
If you already know that your subject is valid, and you want to skip these |
| 1725 |
Some convenience functions are provided for extracting the captured substrings |
Some convenience functions are provided for extracting the captured substrings |
| 1726 |
as separate strings. These are described below. |
as separate strings. These are described below. |
| 1727 |
. |
. |
| 1728 |
|
. |
| 1729 |
.\" HTML <a name="errorlist"></a> |
.\" HTML <a name="errorlist"></a> |
| 1730 |
.SS "Error return values from \fBpcre_exec()\fP" |
.SS "Error return values from \fBpcre_exec()\fP" |
| 1731 |
.rs |
.rs |
| 1795 |
.sp |
.sp |
| 1796 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
| 1797 |
.sp |
.sp |
| 1798 |
A string that contains an invalid UTF-8 byte sequence was passed as a subject. |
A string that contains an invalid UTF-8 byte sequence was passed as a subject, |
| 1799 |
However, if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 |
and the PCRE_NO_UTF8_CHECK option was not set. If the size of the output vector |
| 1800 |
character at the end of the subject, PCRE_ERROR_SHORTUTF8 is used instead. |
(\fIovecsize\fP) is at least 2, the byte offset to the start of the the invalid |
| 1801 |
|
UTF-8 character is placed in the first element, and a reason code is placed in |
| 1802 |
|
the second element. The reason codes are listed in the |
| 1803 |
|
.\" HTML <a href="#badutf8reasons"> |
| 1804 |
|
.\" </a> |
| 1805 |
|
following section. |
| 1806 |
|
.\" |
| 1807 |
|
For backward compatibility, if PCRE_PARTIAL_HARD is set and the problem is a |
| 1808 |
|
truncated UTF-8 character at the end of the subject (reason codes 1 to 5), |
| 1809 |
|
PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8. |
| 1810 |
.sp |
.sp |
| 1811 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
| 1812 |
.sp |
.sp |
| 1813 |
The UTF-8 byte sequence that was passed as a subject was valid, but the value |
The UTF-8 byte sequence that was passed as a subject was checked and found to |
| 1814 |
of \fIstartoffset\fP did not point to the beginning of a UTF-8 character or the |
be valid (the PCRE_NO_UTF8_CHECK option was not set), but the value of |
| 1815 |
|
\fIstartoffset\fP did not point to the beginning of a UTF-8 character or the |
| 1816 |
end of the subject. |
end of the subject. |
| 1817 |
.sp |
.sp |
| 1818 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
| 1856 |
.sp |
.sp |
| 1857 |
PCRE_ERROR_SHORTUTF8 (-25) |
PCRE_ERROR_SHORTUTF8 (-25) |
| 1858 |
.sp |
.sp |
| 1859 |
The subject string ended with an incomplete (truncated) UTF-8 character, and |
This error is returned instead of PCRE_ERROR_BADUTF8 when the subject string |
| 1860 |
the PCRE_PARTIAL_HARD option was set. Without this option, PCRE_ERROR_BADUTF8 |
ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD option is set. |
| 1861 |
is returned in this situation. |
Information about the failure is returned as for PCRE_ERROR_BADUTF8. It is in |
| 1862 |
|
fact sufficient to detect this case, but this special error code for |
| 1863 |
|
PCRE_PARTIAL_HARD precedes the implementation of returned information; it is |
| 1864 |
|
retained for backwards compatibility. |
| 1865 |
|
.sp |
| 1866 |
|
PCRE_ERROR_RECURSELOOP (-26) |
| 1867 |
|
.sp |
| 1868 |
|
This error is returned when \fBpcre_exec()\fP detects a recursion loop within |
| 1869 |
|
the pattern. Specifically, it means that either the whole pattern or a |
| 1870 |
|
subpattern has been called recursively for the second time at the same position |
| 1871 |
|
in the subject string. Some simple patterns that might do this are detected and |
| 1872 |
|
faulted at compile time, but more complicated cases, in particular mutual |
| 1873 |
|
recursions between two different subpatterns, cannot be detected until run |
| 1874 |
|
time. |
| 1875 |
.P |
.P |
| 1876 |
Error numbers -16 to -20 and -22 are not used by \fBpcre_exec()\fP. |
Error numbers -16 to -20 and -22 are not used by \fBpcre_exec()\fP. |
| 1877 |
. |
. |
| 1878 |
. |
. |
| 1879 |
|
.\" HTML <a name="badutf8reasons"></a> |
| 1880 |
|
.SS "Reason codes for invalid UTF-8 strings" |
| 1881 |
|
.rs |
| 1882 |
|
.sp |
| 1883 |
|
When \fBpcre_exec()\fP returns either PCRE_ERROR_BADUTF8 or |
| 1884 |
|
PCRE_ERROR_SHORTUTF8, and the size of the output vector (\fIovecsize\fP) is at |
| 1885 |
|
least 2, the offset of the start of the invalid UTF-8 character is placed in |
| 1886 |
|
the first output vector element (\fIovector[0]\fP) and a reason code is placed |
| 1887 |
|
in the second element (\fIovector[1]\fP). The reason codes are given names in |
| 1888 |
|
the \fBpcre.h\fP header file: |
| 1889 |
|
.sp |
| 1890 |
|
PCRE_UTF8_ERR1 |
| 1891 |
|
PCRE_UTF8_ERR2 |
| 1892 |
|
PCRE_UTF8_ERR3 |
| 1893 |
|
PCRE_UTF8_ERR4 |
| 1894 |
|
PCRE_UTF8_ERR5 |
| 1895 |
|
.sp |
| 1896 |
|
The string ends with a truncated UTF-8 character; the code specifies how many |
| 1897 |
|
bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be |
| 1898 |
|
no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279) |
| 1899 |
|
allows for up to 6 bytes, and this is checked first; hence the possibility of |
| 1900 |
|
4 or 5 missing bytes. |
| 1901 |
|
.sp |
| 1902 |
|
PCRE_UTF8_ERR6 |
| 1903 |
|
PCRE_UTF8_ERR7 |
| 1904 |
|
PCRE_UTF8_ERR8 |
| 1905 |
|
PCRE_UTF8_ERR9 |
| 1906 |
|
PCRE_UTF8_ERR10 |
| 1907 |
|
.sp |
| 1908 |
|
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the |
| 1909 |
|
character do not have the binary value 0b10 (that is, either the most |
| 1910 |
|
significant bit is 0, or the next bit is 1). |
| 1911 |
|
.sp |
| 1912 |
|
PCRE_UTF8_ERR11 |
| 1913 |
|
PCRE_UTF8_ERR12 |
| 1914 |
|
.sp |
| 1915 |
|
A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long; |
| 1916 |
|
these code points are excluded by RFC 3629. |
| 1917 |
|
.sp |
| 1918 |
|
PCRE_UTF8_ERR13 |
| 1919 |
|
.sp |
| 1920 |
|
A 4-byte character has a value greater than 0x10fff; these code points are |
| 1921 |
|
excluded by RFC 3629. |
| 1922 |
|
.sp |
| 1923 |
|
PCRE_UTF8_ERR14 |
| 1924 |
|
.sp |
| 1925 |
|
A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of |
| 1926 |
|
code points are reserved by RFC 3629 for use with UTF-16, and so are excluded |
| 1927 |
|
from UTF-8. |
| 1928 |
|
.sp |
| 1929 |
|
PCRE_UTF8_ERR15 |
| 1930 |
|
PCRE_UTF8_ERR16 |
| 1931 |
|
PCRE_UTF8_ERR17 |
| 1932 |
|
PCRE_UTF8_ERR18 |
| 1933 |
|
PCRE_UTF8_ERR19 |
| 1934 |
|
.sp |
| 1935 |
|
A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a |
| 1936 |
|
value that can be represented by fewer bytes, which is invalid. For example, |
| 1937 |
|
the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just |
| 1938 |
|
one byte. |
| 1939 |
|
.sp |
| 1940 |
|
PCRE_UTF8_ERR20 |
| 1941 |
|
.sp |
| 1942 |
|
The two most significant bits of the first byte of a character have the binary |
| 1943 |
|
value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a |
| 1944 |
|
byte can only validly occur as the second or subsequent byte of a multi-byte |
| 1945 |
|
character. |
| 1946 |
|
.sp |
| 1947 |
|
PCRE_UTF8_ERR21 |
| 1948 |
|
.sp |
| 1949 |
|
The first byte of a character has the value 0xfe or 0xff. These values can |
| 1950 |
|
never occur in a valid UTF-8 string. |
| 1951 |
|
. |
| 1952 |
|
. |
| 1953 |
.SH "EXTRACTING CAPTURED SUBSTRINGS BY NUMBER" |
.SH "EXTRACTING CAPTURED SUBSTRINGS BY NUMBER" |
| 1954 |
.rs |
.rs |
| 1955 |
.sp |
.sp |
| 2145 |
has run, they point to the first and last entries in the name-to-number table |
has run, they point to the first and last entries in the name-to-number table |
| 2146 |
for the given name. The function itself returns the length of each entry, or |
for the given name. The function itself returns the length of each entry, or |
| 2147 |
PCRE_ERROR_NOSUBSTRING (-7) if there are none. The format of the table is |
PCRE_ERROR_NOSUBSTRING (-7) if there are none. The format of the table is |
| 2148 |
described above in the section entitled \fIInformation about a pattern\fP. |
described above in the section entitled \fIInformation about a pattern\fP |
| 2149 |
|
.\" HTML <a href="#infoaboutpattern"> |
| 2150 |
|
.\" </a> |
| 2151 |
|
above. |
| 2152 |
|
.\" |
| 2153 |
Given all the relevant entries for the name, you can extract each of their |
Given all the relevant entries for the name, you can extract each of their |
| 2154 |
numbers, and hence the captured data, if any. |
numbers, and hence the captured data, if any. |
| 2155 |
. |
. |
| 2279 |
.\" |
.\" |
| 2280 |
documentation. |
documentation. |
| 2281 |
. |
. |
| 2282 |
|
. |
| 2283 |
.SS "Successful returns from \fBpcre_dfa_exec()\fP" |
.SS "Successful returns from \fBpcre_dfa_exec()\fP" |
| 2284 |
.rs |
.rs |
| 2285 |
.sp |
.sp |
| 2313 |
\fIovector\fP, the yield of the function is zero, and the vector is filled with |
\fIovector\fP, the yield of the function is zero, and the vector is filled with |
| 2314 |
the longest matches. |
the longest matches. |
| 2315 |
. |
. |
| 2316 |
|
. |
| 2317 |
.SS "Error returns from \fBpcre_dfa_exec()\fP" |
.SS "Error returns from \fBpcre_dfa_exec()\fP" |
| 2318 |
.rs |
.rs |
| 2319 |
.sp |
.sp |
| 2379 |
.rs |
.rs |
| 2380 |
.sp |
.sp |
| 2381 |
.nf |
.nf |
| 2382 |
Last updated: 21 November 2010 |
Last updated: 28 July 2011 |
| 2383 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
| 2384 |
.fi |
.fi |